Skip to content

Commit d91afc6

Browse files
committed
Update on "[Executorch][llama] Make RoPE freq calculation broadcast for per head"
This is a workaround, may not be even worth landing, to avoid broadcasting semantics in the mul op and for that matter any binary op. Current implementation of oiptimized ops doesnt handle broadcasting and falls back to portable op implementation. This diff also fixes an issue where (as seen in llama) two tensors of binary op are not broadcasting, but they have different # of dims, which results in invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1, 2048]. In llama case this is optimized path when generating one token at a time. Not so during pre-fill Making optimized op handle broadcasting, and support vectorization, is not hard, but may take some time. Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/) [ghstack-poisoned]
2 parents c7d9d3b + bd22d18 commit d91afc6

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

examples/models/llama2/export_llama_lib.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,9 @@
2222
from executorch.backends.xnnpack.partition.xnnpack_partitioner import (
2323
XnnpackDynamicallyQuantizedPartitioner,
2424
)
25-
from executorch.exir.backend.backend_details import CompileSpec
2625

2726
from executorch.examples.models.llama2.llama_transformer import Transformer
27+
from executorch.exir.backend.backend_details import CompileSpec
2828

2929
from executorch.sdk.etrecord import generate_etrecord
3030
from executorch.util.activation_memory_profiler import generate_memory_trace

0 commit comments

Comments
 (0)