Skip to content

Commit eb6f954

Browse files
committed
[Executorch][perf-ci] Fix perf ci
Summary: Previous PR #7927 deecoupled max_seq_length from kv cache. That broke perf ci workflow. Fix that. Test Plan: Trigger it manually and check Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: cd637af Pull Request resolved: #8374
1 parent 78752a0 commit eb6f954

File tree

5 files changed

+13
-5
lines changed

5 files changed

+13
-5
lines changed

.github/workflows/android-perf.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,7 @@ jobs:
222222
--preq_mode 8da4w_output_8da8w \
223223
--preq_group_size 32 \
224224
--max_seq_length 2048 \
225+
--max_context_length 2048 \
225226
--output_name "${OUT_ET_MODEL_NAME}.pte" \
226227
-kv \
227228
-d fp32 \
@@ -253,6 +254,7 @@ jobs:
253254
--xnnpack-extended-ops \
254255
-d fp32 \
255256
--max_seq_length 2048 \
257+
--max_context_length 2048 \
256258
--output_name "${OUT_ET_MODEL_NAME}.pte" \
257259
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
258260
ls -lh "${OUT_ET_MODEL_NAME}.pte"

.github/workflows/apple-perf.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,7 @@ jobs:
233233
--preq_mode 8da4w_output_8da8w \
234234
--preq_group_size 32 \
235235
--max_seq_length 2048 \
236+
--max_context_length 2048 \
236237
--output_name "${OUT_ET_MODEL_NAME}.pte" \
237238
-kv \
238239
-d fp32 \
@@ -264,6 +265,7 @@ jobs:
264265
--xnnpack-extended-ops \
265266
-d fp32 \
266267
--max_seq_length 2048 \
268+
--max_context_length 2048 \
267269
--output_name "${OUT_ET_MODEL_NAME}.pte" \
268270
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
269271
ls -lh "${OUT_ET_MODEL_NAME}.pte"

examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -56,14 +56,14 @@ In this demo app, we support text-only inference with up-to-date Llama models an
5656
Meta has released prequantized INT4 SpinQuant Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
5757
* Export Llama model and generate .pte file as below:
5858
```
59-
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
59+
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --max_context_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
6060
```
6161

6262
### For Llama 3.2 1B and 3B QAT+LoRA models
6363
Meta has released prequantized INT4 QAT+LoRA Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
6464
* Export Llama model and generate .pte file as below:
6565
```
66-
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
66+
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --max_context_length 2048--preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
6767
```
6868

6969
### For Llama 3.2 1B and 3B BF16 models
@@ -87,7 +87,7 @@ To safeguard your application, you can use our Llama Guard models for prompt cla
8787
* We prepared this model using the following command
8888

8989
```
90-
python -m examples.models.llama.export_llama --checkpoint <path-to-pruned-llama-guard-1b-checkpoint.pth> --params <path-to-your-params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <path-to-your-llama_guard-pruned-layers-map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
90+
python -m examples.models.llama.export_llama --checkpoint <path-to-pruned-llama-guard-1b-checkpoint.pth> --params <path-to-your-params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --max_context_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <path-to-your-llama_guard-pruned-layers-map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
9191
```
9292

9393

examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,14 +51,14 @@ sh examples/models/llama/install_requirements.sh
5151
Meta has released prequantized INT4 SpinQuant Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
5252
* Export Llama model and generate .pte file as below:
5353
```
54-
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
54+
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --max_context_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
5555
```
5656

5757
### For Llama 3.2 1B and 3B QAT+LoRA models
5858
Meta has released prequantized INT4 QAT+LoRA Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
5959
* Export Llama model and generate .pte file as below:
6060
```
61-
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
61+
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --max_context_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
6262
```
6363

6464
### For Llama 3.2 1B and 3B BF16 models

examples/models/llama/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,6 +199,7 @@ python -m examples.models.llama.export_llama \
199199
--preq_mode 8da4w_output_8da8w \
200200
--preq_group_size 32 \
201201
--max_seq_length 2048 \
202+
--max_context_length 2048 \
202203
--output_name "llama3_2.pte" \
203204
-kv \
204205
-d fp32 \
@@ -230,6 +231,7 @@ python -m examples.models.llama.export_llama \
230231
--xnnpack-extended-ops \
231232
-d fp32 \
232233
--max_seq_length 2048 \
234+
--max_context_length 2048 \
233235
--output_name "llama3_2.pte" \
234236
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
235237
```
@@ -397,6 +399,7 @@ python -m examples.models.llama.eval_llama \
397399
-kv \
398400
-d <checkpoint dtype> \
399401
--max_seq_len <max sequence length> \
402+
--max_context_len <max context length> \
400403
--limit <number of samples>
401404
```
402405
@@ -411,6 +414,7 @@ python -m examples.models.llama.eval_llama \
411414
--tasks mmlu \
412415
--num_fewshot 5 \
413416
--max_seq_len <max sequence length>
417+
--max_context_len <max context length>
414418
```
415419
416420
See [Llama utils page](./UTILS.md) page for more advanced use-cases such as fine-tuning and running smaller models for educational purposes, and quick iteration and verification.

0 commit comments

Comments
 (0)