Skip to content

[Executorch][llama] Make RoPE freq calculation broadcast for per head #2353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

kimishpatel
Copy link
Contributor

@kimishpatel kimishpatel commented Mar 11, 2024

Stack from ghstack (oldest at bottom):

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: D54766067

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Mar 11, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2353

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d91afc6 with merge base 80e3989 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 11, 2024
kimishpatel added a commit that referenced this pull request Mar 11, 2024
This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

ghstack-source-id: 218209332
Pull Request resolved: #2353
…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

kimishpatel added a commit that referenced this pull request Mar 11, 2024
Pull Request resolved: #2353

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.
ghstack-source-id: 218210434
@exported-using-ghexport

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

kimishpatel added a commit that referenced this pull request Mar 12, 2024
Pull Request resolved: #2353

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.
ghstack-source-id: 218361723
@exported-using-ghexport

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)
…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
kimishpatel added a commit that referenced this pull request Mar 14, 2024
Pull Request resolved: #2353

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.
ghstack-source-id: 218702218
@exported-using-ghexport

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

kimishpatel added a commit that referenced this pull request Mar 14, 2024
Pull Request resolved: #2353

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.
ghstack-source-id: 218744220
@exported-using-ghexport

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)
…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
kimishpatel added a commit that referenced this pull request Mar 15, 2024
Pull Request resolved: #2353

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.
ghstack-source-id: 218755098
@exported-using-ghexport

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

kimishpatel added a commit that referenced this pull request Mar 15, 2024
Pull Request resolved: #2353

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.
ghstack-source-id: 218812795
@exported-using-ghexport

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)
…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

kimishpatel added a commit that referenced this pull request Mar 15, 2024
Pull Request resolved: #2353

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.
ghstack-source-id: 218819090
@exported-using-ghexport

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)
…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

kimishpatel added a commit that referenced this pull request Mar 16, 2024
Pull Request resolved: #2353

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.
ghstack-source-id: 218918369
@exported-using-ghexport

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)
…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

…or per head"

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54766067

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 08733f0.

kedarnath03 pushed a commit to kedarnath03/executorch that referenced this pull request Jun 25, 2025
Pull Request resolved: pytorch/executorch#2353

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.
ghstack-source-id: 219444233
@exported-using-ghexport

Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants