[Executorch][llama] Make RoPE freq calculation broadcast for per head #2353

kimishpatel · 2024-03-11T19:33:01Z

Stack from ghstack (oldest at bottom):

This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.

This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].

In llama case this is optimized path when generating one token at a time. Not
so during pre-fill

Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.

Differential Revision: D54766067

This is a workaround, may not be even worth landing, to avoid broadcasting semantics in the mul op and for that matter any binary op. Current implementation of oiptimized ops doesnt handle broadcasting and falls back to portable op implementation. This diff also fixes an issue where (as seen in llama) two tensors of binary op are not broadcasting, but they have different # of dims, which results in invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1, 2048]. In llama case this is optimized path when generating one token at a time. Not so during pre-fill Making optimized op handle broadcasting, and support vectorization, is not hard, but may take some time. Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/) [ghstack-poisoned]

pytorch-bot · 2024-03-11T19:33:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2353

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d91afc6 with merge base 80e3989 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This is a workaround, may not be even worth landing, to avoid broadcasting semantics in the mul op and for that matter any binary op. Current implementation of oiptimized ops doesnt handle broadcasting and falls back to portable op implementation. This diff also fixes an issue where (as seen in llama) two tensors of binary op are not broadcasting, but they have different # of dims, which results in invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1, 2048]. In llama case this is optimized path when generating one token at a time. Not so during pre-fill Making optimized op handle broadcasting, and support vectorization, is not hard, but may take some time. Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/) ghstack-source-id: 218209332 Pull Request resolved: #2353

…or per head" This is a workaround, may not be even worth landing, to avoid broadcasting semantics in the mul op and for that matter any binary op. Current implementation of oiptimized ops doesnt handle broadcasting and falls back to portable op implementation. This diff also fixes an issue where (as seen in llama) two tensors of binary op are not broadcasting, but they have different # of dims, which results in invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1, 2048]. In llama case this is optimized path when generating one token at a time. Not so during pre-fill Making optimized op handle broadcasting, and support vectorization, is not hard, but may take some time. Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/) [ghstack-poisoned]

facebook-github-bot · 2024-03-11T19:39:53Z