Update on "[Executorch][llama] Make RoPE freq calculation broadcast for per head"

kimishpatel · kimishpatel · commit d91afc62bf3f · 2024-03-20T13:08:19.000-07:00
This is a workaround, may not be even worth landing, to avoid broadcasting semantics in the mul op and for that matter any binary op. Current implementation of oiptimized ops doesnt handle broadcasting and falls back to portable op implementation. This diff also fixes an issue where (as seen in llama) two tensors of binary op are not broadcasting, but they have different # of dims, which results in invocation of unoptimized path. e.g. a = [1, 1, 2048], b = [2048], out = [1, 1, 2048]. In llama case this is optimized path when generating one token at a time. Not so during pre-fill Making optimized op handle broadcasting, and support vectorization, is not hard, but may take some time. Differential Revision: [D54766067](https://our.internmc.facebook.com/intern/diff/D54766067/) [ghstack-poisoned]
diff --git a/examples/models/llama2/export_llama_lib.py b/examples/models/llama2/export_llama_lib.py
@@ -22,9 +22,9 @@
 from executorch.backends.xnnpack.partition.xnnpack_partitioner import (
     XnnpackDynamicallyQuantizedPartitioner,
 )
-from executorch.exir.backend.backend_details import CompileSpec
 
 from executorch.examples.models.llama2.llama_transformer import Transformer
+from executorch.exir.backend.backend_details import CompileSpec
 
 from executorch.sdk.etrecord import generate_etrecord
 from executorch.util.activation_memory_profiler import generate_memory_trace

Original file line number	Diff line number	Diff line change
`@@ -22,9 +22,9 @@`
`22`	`22`	`from executorch.backends.xnnpack.partition.xnnpack_partitioner import (`
`23`	`23`	`XnnpackDynamicallyQuantizedPartitioner,`
`24`	`24`	`)`
`25`		`-from executorch.exir.backend.backend_details import CompileSpec`
`26`	`25`
`27`	`26`	`from executorch.examples.models.llama2.llama_transformer import Transformer`
	`27`	`+from executorch.exir.backend.backend_details import CompileSpec`
`28`	`28`
`29`	`29`	`from executorch.sdk.etrecord import generate_etrecord`
`30`	`30`	`from executorch.util.activation_memory_profiler import generate_memory_trace`