You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update on "[Executorch][llama] Make RoPE freq calculation broadcast for per head"
This is a workaround, may not be even worth landing, to avoid broadcasting
semantics in the mul op and for that matter any binary op. Current
implementation of oiptimized ops doesnt handle broadcasting and falls back to
portable op implementation.
This diff also fixes an issue where (as seen in llama) two tensors of binary op
are not broadcasting, but they have different # of dims, which results in
invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1,
2048].
In llama case this is optimized path when generating one token at a time. Not
so during pre-fill
Making optimized op handle broadcasting, and support vectorization, is not
hard, but may take some time.
Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/)
[ghstack-poisoned]
0 commit comments