@@ -202,7 +202,7 @@ tokenizer.py utility to convert the tokenizer.model to tokenizer.bin
202
202
format:
203
203
204
204
```
205
- python utils/tokenizer.py --tokenizer-model=${MODEL_DIR}tokenizer.model
205
+ python3 utils/tokenizer.py --tokenizer-model=${MODEL_DIR}tokenizer.model
206
206
```
207
207
208
208
We will later disucss how to use this model, as described under * STANDALONE EXECUTION* in a Python-free
@@ -226,7 +226,7 @@ At present, we always use the torchchat model for export and import the checkpoi
226
226
because we have tested that model with the export descriptions described herein.
227
227
228
228
```
229
- python generate.py --compile --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --device [ cuda | cpu | mps]
229
+ python3 generate.py --compile --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --device [ cuda | cpu | mps]
230
230
```
231
231
232
232
To squeeze out a little bit more performance, you can also compile the
@@ -240,12 +240,12 @@ though.
240
240
Let's start by exporting and running a small model like stories15M.
241
241
242
242
```
243
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --output-pte-path ${MODEL_OUT}/model.pte
243
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --output-pte-path ${MODEL_OUT}/model.pte
244
244
```
245
245
246
246
### AOT Inductor compilation and execution
247
247
```
248
- python export.py --checkpoint-path ${MODEL_PATH} --device {cuda,cpu} --output-dso-path ${MODEL_OUT}/${MODEL_NAME}.so
248
+ python3 export.py --checkpoint-path ${MODEL_PATH} --device {cuda,cpu} --output-dso-path ${MODEL_OUT}/${MODEL_NAME}.so
249
249
```
250
250
251
251
When you have exported the model, you can test the model with the
@@ -256,7 +256,7 @@ exported model with the same interface, and support additional
256
256
experiments to confirm model quality and speed.
257
257
258
258
```
259
- python generate.py --device {cuda,cpu} --dso-path ${MODEL_OUT}/${MODEL_NAME}.so --prompt "Hello my name is"
259
+ python3 generate.py --device {cuda,cpu} --dso-path ${MODEL_OUT}/${MODEL_NAME}.so --prompt "Hello my name is"
260
260
```
261
261
262
262
While we have shown the export and execution of a small model on CPU
@@ -278,7 +278,7 @@ delegates such as Vulkan, CoreML, MPS, HTP in addition to Xnnpack as they are re
278
278
With the model exported, you can now generate text with the executorch runtime pybindings. Feel free to play around with the prompt.
279
279
280
280
```
281
- python generate.py --checkpoint-path ${MODEL_PATH} --pte ${MODEL_OUT}/model.pte --device cpu --prompt "Once upon a time"
281
+ python3 generate.py --checkpoint-path ${MODEL_PATH} --pte ${MODEL_OUT}/model.pte --device cpu --prompt "Once upon a time"
282
282
```
283
283
284
284
You can also run the model with the runner-et. See below under "Standalone Execution".
@@ -322,8 +322,8 @@ linear operator (asymmetric) with HQQ | n/a | work in progress | n/a |
322
322
You can generate models (for both export and generate, with eager, torch.compile, AOTI, ET, for all backends - mobile at present will primarily support fp32, with all options)
323
323
specify the precision of the model with
324
324
```
325
- python generate.py --dtype [bf16 | fp16 | fp32] ...
326
- python export.py --dtype [bf16 | fp16 | fp32] ...
325
+ python3 generate.py --dtype [bf16 | fp16 | fp32] ...
326
+ python3 export.py --dtype [bf16 | fp16 | fp32] ...
327
327
```
328
328
329
329
Unlike gpt-fast which uses bfloat16 as default, Torch@ uses float32 as the default. As a consequence you will have to set to ` --dtype bf16 ` or ` --dtype fp16 ` on server / desktop for best performance.
@@ -366,35 +366,35 @@ We can do this in eager mode (optionally with torch.compile), we use the `embedd
366
366
groupsize set to 0 which uses channelwise quantization:
367
367
368
368
```
369
- python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
369
+ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
370
370
```
371
371
372
372
Then, export as follows:
373
373
```
374
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
374
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
375
375
```
376
376
377
377
Now you can run your model with the same command as before:
378
378
```
379
- python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hello my name is"
379
+ python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hello my name is"
380
380
```
381
381
382
382
* Groupwise quantization* :
383
383
384
384
We can do this in eager mode (optionally with ` torch.compile ` ), we use the ` embedding ` quantizer by specifying the group size:
385
385
386
386
```
387
- python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
387
+ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
388
388
```
389
389
390
390
Then, export as follows:
391
391
```
392
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 8} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
392
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 8} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
393
393
```
394
394
395
395
Now you can run your model with the same command as before:
396
396
```
397
- python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
397
+ python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
398
398
```
399
399
400
400
#### Embedding quantization (4 bit integer, channelwise & groupwise)
@@ -410,35 +410,35 @@ We can do this in eager mode (optionally with torch.compile), we use the `embedd
410
410
groupsize set to 0 which uses channelwise quantization:
411
411
412
412
```
413
- python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 0}}' --device cpu
413
+ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 0}}' --device cpu
414
414
```
415
415
416
416
Then, export as follows:
417
417
```
418
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
418
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
419
419
```
420
420
421
421
Now you can run your model with the same command as before:
422
422
```
423
- python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hello my name is"
423
+ python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hello my name is"
424
424
```
425
425
426
426
* Groupwise quantization* :
427
427
428
428
We can do this in eager mode (optionally with ` torch.compile ` ), we use the ` embedding ` quantizer by specifying the group size:
429
429
430
430
```
431
- python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 8}}' --device cpu
431
+ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 8}}' --device cpu
432
432
```
433
433
434
434
Then, export as follows:
435
435
```
436
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
436
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
437
437
```
438
438
439
439
Now you can run your model with the same command as before:
440
440
```
441
- python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
441
+ python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
442
442
```
443
443
444
444
#### Linear 8 bit integer quantization (channel-wise and groupwise)
@@ -455,55 +455,55 @@ We can do this in eager mode (optionally with torch.compile), we use the `linear
455
455
groupsize set to 0 which uses channelwise quantization:
456
456
457
457
```
458
- python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
458
+ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
459
459
```
460
460
461
461
Then, export as follows using ExecuTorch for mobile backends:
462
462
```
463
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte
463
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte
464
464
```
465
465
466
466
Now you can run your model with the same command as before:
467
467
```
468
- python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
468
+ python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
469
469
```
470
470
471
471
Or, export as follows for server/desktop deployments:
472
472
```
473
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.so
473
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.so
474
474
```
475
475
476
476
Now you can run your model with the same command as before:
477
477
```
478
- python generate.py --dso-path ${MODEL_OUT}/${MODEL_NAME}_int8.so --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
478
+ python3 generate.py --dso-path ${MODEL_OUT}/${MODEL_NAME}_int8.so --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
479
479
```
480
480
481
481
* Groupwise quantization* :
482
482
483
483
We can do this in eager mode (optionally with ` torch.compile ` ), we use the ` linear:int8 ` quantizer by specifying the group size:
484
484
485
485
```
486
- python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
486
+ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
487
487
```
488
488
489
489
Then, export as follows using ExecuTorch:
490
490
```
491
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte
491
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte
492
492
```
493
493
494
494
Now you can run your model with the same command as before:
495
495
```
496
- python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
496
+ python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
497
497
```
498
498
499
499
Or, export as follows for :
500
500
```
501
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so
501
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so
502
502
```
503
503
504
504
Now you can run your model with the same command as before:
505
505
```
506
- python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so --checkpoint-path ${MODEL_PATH} -d fp32 --prompt "Hello my name is"
506
+ python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so --checkpoint-path ${MODEL_PATH} -d fp32 --prompt "Hello my name is"
507
507
```
508
508
509
509
Please note that group-wise quantization works functionally, but has
@@ -515,36 +515,36 @@ operator.
515
515
To compress your model even more, 4-bit integer quantization may be used. To achieve good accuracy, we recommend the use
516
516
of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale.
517
517
```
518
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:int4': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso]
518
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:int4': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso]
519
519
```
520
520
521
521
Now you can run your model with the same command as before:
522
522
```
523
- python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso] --prompt "Hello my name is"
523
+ python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso] --prompt "Hello my name is"
524
524
```
525
525
526
526
#### 4-bit integer quantization (8da4w)
527
527
To compress your model even more, 4-bit integer quantization may be used. To achieve good accuracy, we recommend the use
528
528
of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale. We also quantize activations to 8-bit, giving
529
529
this scheme its name (8da4w = 8b dynamically quantized activations with 4b weights), and boost performance.
530
530
```
531
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:8da4w': {'groupsize' : 7} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso... ]
531
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:8da4w': {'groupsize' : 7} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso... ]
532
532
```
533
533
534
534
Now you can run your model with the same command as before:
535
535
```
536
- python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso...] --prompt "Hello my name is"
536
+ python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso...] --prompt "Hello my name is"
537
537
```
538
538
539
539
#### Quantization with GPTQ (gptq)
540
540
541
541
```
542
- python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:gptq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ] # may require additional options, check with AO team
542
+ python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:gptq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ] # may require additional options, check with AO team
543
543
```
544
544
545
545
Now you can run your model with the same command as before:
546
546
```
547
- python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso...] --prompt "Hello my name is"
547
+ python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso...] --prompt "Hello my name is"
548
548
```
549
549
550
550
#### Adding additional quantization schemes (hqq)
0 commit comments