Skip to content

Commit 55cc430

Browse files
digantdesaifacebook-github-bot
authored andcommitted
Always use two XNNPACK Partitioners (#5573)
Summary: Pull Request resolved: #5573 This changes the default behavior. Helps prefill ~20%, hurts decode ~7%. As a next step, I will try to debug more into perf regression on decode and if anything more we can get on prefill by tuning xnnpack thread dispatcher for gemm, gemv, mul, add, sigmoid, and sub. **On my local (unreliable) S23** - * Vanilla: ``` dm1q:/data/local/tmp/llama $ ./llama_main_release \ --model_path ./llama_gs32_vanilla.pte \ --tokenizer_path ./tokenizer.bin \ --seq_len=128 \ --prompt="${prompt}" [...] I 00:00:22.188618 executorch:stats.h:84] Prompt Tokens: 44 Generated Tokens: 83 I 00:00:22.188621 executorch:stats.h:90] Model Load Time: 12.922000 (seconds) I 00:00:22.188624 executorch:stats.h:100] Total inference time: 9.252000 (seconds) Rate: 8.971033 (tokens/second) I 00:00:22.188627 executorch:stats.h:108] Prompt evaluation: 1.740000 (seconds) Rate: 25.287356 (tokens/second) I 00:00:22.188630 executorch:stats.h:119] Generated 83 tokens: 7.512000 (seconds) Rate: 11.048988 (tokens/second) I 00:00:22.188632 executorch:stats.h:127] Time to first generated token: 1.740000 (seconds) I 00:00:22.188634 executorch:stats.h:134] Sampling time over 127 tokens: 0.015000 (seconds) [...] ``` * Two partition (2part) ``` dm1q:/data/local/tmp/llama $ ./llama_main_release \ --model_path ./llama_gs32_2part.pte \ # New PTE --tokenizer_path ./tokenizer.bin \ --seq_len=128 \ --prompt="${prompt}" [...] I 00:00:22.205058 executorch:stats.h:84] Prompt Tokens: 44 Generated Tokens: 83 I 00:00:22.205061 executorch:stats.h:90] Model Load Time: 12.876000 (seconds) I 00:00:22.205063 executorch:stats.h:100] Total inference time: 9.323000 (seconds) Rate: 8.902714 (tokens/second) I 00:00:22.205067 executorch:stats.h:108] Prompt evaluation: 1.549000 (seconds) Rate: 28.405423 (tokens/second) I 00:00:22.205070 executorch:stats.h:119] Generated 83 tokens: 7.774000 (seconds) Rate: 10.676614 (tokens/second) I 00:00:22.205073 executorch:stats.h:127] Time to first generated token: 1.549000 (seconds) I 00:00:22.205075 executorch:stats.h:134] Sampling time over 127 tokens: 0.029000 (seconds) [...] ``` **Similar results on AiBench OnePlus12**, * Vanilla, AiBench Links: [gs=32](https://www.internalfb.com/intern/aibench/details/114258284562772), [gs=256](https://www.internalfb.com/intern/aibench/details/438103192423336) ``` # gs=32 I 00:00:21.792659 executorch:stats.h:84] Prompt Tokens: 5 Generated Tokens: 118 I 00:00:21.792721 executorch:stats.h:90] Model Load Time: 11.666000 (seconds) I 00:00:21.792754 executorch:stats.h:100] Total inference time: 10.109000 (seconds) Rate: 11.672767 (tokens/second) I 00:00:21.792778 executorch:stats.h:108] Prompt evaluation: 0.365000 (seconds) Rate: 13.698630 (tokens/second) I 00:00:21.792799 executorch:stats.h:119] Generated 118 tokens: 9.744000 (seconds) Rate: 12.110016 (tokens/second) I 00:00:21.792818 executorch:stats.h:127] Time to first generated token: 0.365000 (seconds) I 00:00:21.792837 executorch:stats.h:134] Sampling time over 123 tokens: 0.008000 (seconds) ``` * Two partition, AiBench Links: [gs=32](https://www.internalfb.com/intern/aibench/details/852029802754424), [gs=256](https://www.internalfb.com/intern/aibench/details/491722732991273) ``` # gs=32 I 00:00:22.584271 executorch:stats.h:84] Prompt Tokens: 5 Generated Tokens: 118 I 00:00:22.584336 executorch:stats.h:90] Model Load Time: 11.610000 (seconds) I 00:00:22.584367 executorch:stats.h:100] Total inference time: 10.960000 (seconds) Rate: 10.766423 (tokens/second) I 00:00:22.584389 executorch:stats.h:108] Prompt evaluation: 0.286000 (seconds) Rate: 17.482517 (tokens/second) I 00:00:22.584409 executorch:stats.h:119] Generated 118 tokens: 10.674000 (seconds) Rate: 11.054900 (tokens/second) I 00:00:22.584428 executorch:stats.h:127] Time to first generated token: 0.286000 (seconds) I 00:00:22.584446 executorch:stats.h:134] Sampling time over 123 tokens: 0.013000 (seconds) ``` bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: mcr229 Differential Revision: D63271101 fbshipit-source-id: 1680c8d97ec5ac15011eae8a6d75a003bd0354b4
1 parent bdaad8e commit 55cc430

File tree

3 files changed

+50
-12
lines changed

3 files changed

+50
-12
lines changed

.ci/scripts/test_llama.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,7 @@ EXPORTED_MODEL_NAME="${EXPORTED_MODEL_NAME}.pte"
188188
echo "Exporting ${EXPORTED_MODEL_NAME}"
189189
EXPORT_ARGS="-c ${CHECKPOINT_FILE_NAME} -p ${PARAMS} -d ${DTYPE} -n ${EXPORTED_MODEL_NAME} -kv"
190190
if [[ "${XNNPACK}" == "ON" ]]; then
191-
EXPORT_ARGS="${EXPORT_ARGS} -X -qmode 8da4w -G 128"
191+
EXPORT_ARGS="${EXPORT_ARGS} -X --xnnpack-extended-ops -qmode 8da4w -G 128"
192192
fi
193193
if [[ "${CUSTOM}" == "ON" ]]; then
194194
EXPORT_ARGS="${EXPORT_ARGS} --use_sdpa_with_kv_cache"

examples/models/llama2/export_llama_lib.py

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -297,7 +297,17 @@ def build_args_parser() -> argparse.ArgumentParser:
297297

298298
parser.add_argument("-2", "--fairseq2", action="store_true")
299299
parser.add_argument("-v", "--verbose", action="store_true")
300-
parser.add_argument("-X", "--xnnpack", action="store_true")
300+
parser.add_argument(
301+
"-X",
302+
"--xnnpack",
303+
action="store_true",
304+
help="Delegate to DQLinear ops to the xnnpack backend",
305+
)
306+
parser.add_argument(
307+
"--xnnpack-extended-ops",
308+
action="store_true",
309+
help="Delegate more operators beyond DQLinear to the xnnpack backend. Requires -X or --xnnpack to be set.",
310+
)
301311
parser.add_argument("-V", "--vulkan", action="store_true")
302312
parser.add_argument("--mps", action="store_true")
303313
parser.add_argument("--coreml", action="store_true")
@@ -546,12 +556,24 @@ def _export_llama(modelname, args) -> LLMEdgeManager: # noqa: C901
546556

547557
# to_backend
548558
partitioners = []
549-
if pt2e_quant_params is not None and pt2e_quant_params.quantize_linear is not None:
550-
partitioners.append(get_xnnpack_partitioner())
559+
560+
# Order matters here, dynamic quantization should be applied first when both xnnpack and xnnpack_extended_ops are enabled
561+
if (
562+
pt2e_quant_params is not None and pt2e_quant_params.quantize_linear is not None
563+
) or (args.xnnpack):
564+
partitioners.append(
565+
get_xnnpack_partitioner(dynamic_quant_only_partitioner=True)
566+
)
567+
568+
# force xnnpack to be true if pt2e_quant_params is not None and args.xnnpack is False
569+
args.xnnpack = True
551570
modelname = f"xnnpack_dq_{modelname}"
552571

553-
if args.xnnpack:
554-
partitioners.append(get_xnnpack_partitioner())
572+
if args.xnnpack_extended_ops:
573+
assert args.xnnpack, "xnnpack_extended_ops requires xnnpack to be enabled"
574+
partitioners.append(
575+
get_xnnpack_partitioner(dynamic_quant_only_partitioner=False)
576+
)
555577
modelname = f"xnnpack_{modelname}"
556578

557579
if args.vulkan:
@@ -598,6 +620,10 @@ def _export_llama(modelname, args) -> LLMEdgeManager: # noqa: C901
598620
shares=args.num_sharding,
599621
)
600622

623+
logging.info("Lowering model using following partitioner(s): ")
624+
for partitioner in partitioners:
625+
logging.info(f"--> {partitioner.__class__.__name__}")
626+
601627
if args.generate_etrecord:
602628
if not builder_exported_to_edge.edge_manager:
603629
raise ValueError("Unable to generate etrecord due to missing edge manager.")

extension/llm/export/partitioner_lib.py

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,28 @@
77
from typing import Optional
88

99

10-
def get_xnnpack_partitioner():
10+
def get_xnnpack_partitioner(dynamic_quant_only_partitioner: bool = True):
11+
"""
12+
Returns the XNNPACK partitioner.
13+
14+
@arg dynamic_quant_only_partitioner:
15+
This is enabled by default to keep BC.
16+
If dynamic_quant_only_partitioner is True, then only dynamically quantized
17+
linear layers will be partitioned.
18+
Else, anything which can be will be partitioned greedily.
19+
"""
1120
from executorch.backends.xnnpack.partition.xnnpack_partitioner import (
1221
XnnpackDynamicallyQuantizedPartitioner,
22+
XnnpackPartitioner,
1323
)
1424

15-
# Following changes due to.
16-
# 1. We need dynamically quantized partitioner for both pt2e_quantize options
17-
# as well as "qmode 8da4w" which is also dynamic quantizes linear layers.
18-
# 2. XNNPACK partitioner seems to result in seg fault for non dqlinear ops.
19-
return XnnpackDynamicallyQuantizedPartitioner()
25+
if dynamic_quant_only_partitioner:
26+
# Following changes due to.
27+
# 1. We need dynamically quantized partitioner for both pt2e_quantize options
28+
# as well as "qmode 8da4w" which is also dynamic quantizes linear layers.
29+
# 2. XNNPACK partitioner seems to result in seg fault for non dqlinear ops.
30+
return XnnpackDynamicallyQuantizedPartitioner()
31+
return XnnpackPartitioner()
2032

2133

2234
def get_vulkan_partitioner(

0 commit comments

Comments
 (0)