Quantize llama3 on et export #436

lucylq · 2024-04-23T23:41:21Z

Recommended to quantize for llama models. Update examples to use llama3 rather than stories.
Update quantization docs; weight + embedding + examples.
Some quantization schemes do not work for et export (not intended to work); gptq, linear int4, hqq. Specify support.

Test with:

python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}' --output-pte-path llama3_q.pte

Run:

python3 torchchat.py generate llama3 --device cpu --pte-path llama3_8da_qe4_32.pte --prompt "Hello my name is" --tokenizer-path ../llama-models/llama3/tiktokenizer.bin

<|begin_of_text|>Hello my name is Samantha, I am a trained healthcare professional with a passion for delivering exceptional patient care. I am excited to offer my services as a caregiver through this platform. I have extensive experience in providing assistance with daily living activities, such as bathing, dressing, grooming, meal preparation, medication management, and other essential tasks. I am also trained in first aid and CPR. I am available for short-term or long-term positions and can work flexible hours depending on your needs. I am a non-smoker and a non-drinker, and I have a clean driving record. I look forward to working with you and your loved ones.
Hello my name is Samantha, I am a trained healthcare professional with a passion for delivering exceptional patient care. I am excited to offer my services as a caregiver through this platform. I have extensive experience in providing assistance with daily living activities, such as bathing, dressing, grooming, meal preparation, medication management, and other essential tasks. I am also trained in first aid and CPR.
Time for inference 1: 19.38 sec total, 10.32 tokens/sec
Bandwidth achieved: 0.00 GB/s
==========
Average tokens/sec: 10.32
Memory used: 0.00 GB

With the different export quantization schemes in https://github.com/pytorch/torchchat/blob/main/docs/quantization.md
Not tested for aoti.

README.md

kimishpatel · 2024-04-24T16:45:28Z

In summary can you put the repro step or how you tested it?

iseeyuan · 2024-04-24T17:24:16Z

Could you also upadte examples/models/llama2/README, where for llama3 models, --embedding-quantize 4,32 is suggested to reduce the model size significantly due to the larger vocabulary size.

lucylq · 2024-04-24T17:39:17Z

Could you also upadte examples/models/llama2/README, where for llama3 models, --embedding-quantize 4,32 is suggested to reduce the model size significantly due to the larger vocabulary size.

thanks Martin, added in pytorch/executorch#3315

kimishpatel · 2024-04-24T20:31:03Z

python3 torchchat.py export llama3 --quantize '{"linear:gptq": {"groupsize" : 32} }' --output-pte-path llama3_q.pte

Did you try running the generated pte?

lucylq · 2024-04-24T21:19:19Z

python3 torchchat.py export llama3 --quantize '{"linear:gptq": {"groupsize" : 32} }' --output-pte-path llama3_q.pte

Did you try running the generated pte?

So, the gptq, linear int4, and hqq paths don't export well through executorch. I'll try run the 8da4w+qe file - updated in summary.

README.md

docs/quantization.md

kimishpatel · 2024-04-25T01:00:13Z

@mikekgfb I asked Lucy to make these changes. Quite a bit of stuff toward the latter half of the doc is removed. I think not having a very long doc was better since those were just showcasing example commands on how to invoke each quant scheme. If you differ in the opinion let me know.

Also, is HQQ done or WIP?

docs/quantization.md

README.md

docs/quantization.md

jerryzh168 · 2024-04-25T19:07:55Z

README.md

-Depending on the model and the target device, different quantization recipes may be applied.  Torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/cuda.json` , and mobile systems `config/data/mobile.json`.  The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory).  The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory.
-
-You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the export comamnds below, to optimize the exported models.  To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md).
+You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example:


did we add eval as well

I'll defer to you for adding @jerryzh168

docs/quantization.md

jerryzh168 · 2024-04-25T19:13:26Z

docs/quantization.md

+| linear (asymmetric) | `'{"linear:int<bitwidth>" : {"groupsize" : <groupsize>}}'` |
+| linear with dynamic activations (symmetric) | `'{"linear:a8w4dq" : {"groupsize" : <groupsize>}}'`|
+| linear with GPTQ (asymmetric) | `'{"linear:int4-gptq" : {"groupsize" : <groupsize>}}'`|
+| linear with HQQ (asymmetric) |`'{"linear:hqq" : {"groupsize" : <groupsize>}}'`|


we probably want to tag hqq with bitwidth as well

hmn, where is hqq bitwidth added?

jerryzh168

looks good overall I think, I think the overall structure is better, we just explain multiple dimensions of config and give a few examples of showing how they can be combined

mikekgfb

Please make sure that whatever changes in commands, you made are tested.

Thank you!

mikekgfb · 2024-04-25T19:15:19Z

README.md

-Depending on the model and the target device, different quantization recipes may be applied.  Torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/cuda.json` , and mobile systems `config/data/mobile.json`.  The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory).  The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory.
-
-You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the export comamnds below, to optimize the exported models.  To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md).
+You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example:


I'll defer to you for adding @jerryzh168

README.md

kimishpatel

Thanks for the changes

* create dir on download * quantization

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 23, 2024

kimishpatel reviewed Apr 24, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

lucylq force-pushed the lfq.quantize-on-et-export branch 2 times, most recently from 0611004 to ed95bd1 Compare April 24, 2024 16:39

lucylq marked this pull request as ready for review April 24, 2024 16:40

lucylq requested a review from kimishpatel April 24, 2024 16:40

lucylq force-pushed the lfq.quantize-on-et-export branch 4 times, most recently from 93db116 to 93f7847 Compare April 24, 2024 17:14

lucylq force-pushed the lfq.quantize-on-et-export branch 4 times, most recently from 65ce384 to 224b258 Compare April 25, 2024 00:13

kimishpatel reviewed Apr 25, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

kimishpatel reviewed Apr 25, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

lucylq force-pushed the lfq.quantize-on-et-export branch 4 times, most recently from cc0e202 to 5bf7895 Compare April 25, 2024 00:43

kimishpatel requested changes Apr 25, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

docs/quantization.md Outdated Show resolved Hide resolved

docs/quantization.md Outdated Show resolved Hide resolved

docs/quantization.md Outdated Show resolved Hide resolved

kimishpatel reviewed Apr 25, 2024

View reviewed changes

docs/quantization.md Show resolved Hide resolved

kimishpatel reviewed Apr 25, 2024

View reviewed changes

docs/quantization.md Outdated Show resolved Hide resolved

kimishpatel reviewed Apr 25, 2024

View reviewed changes

docs/quantization.md Show resolved Hide resolved

kimishpatel reviewed Apr 25, 2024

View reviewed changes

docs/quantization.md Outdated Show resolved Hide resolved

lucylq force-pushed the lfq.quantize-on-et-export branch 2 times, most recently from f99752f to 3f81ec4 Compare April 25, 2024 16:27

kimishpatel reviewed Apr 25, 2024

View reviewed changes

README.md Show resolved Hide resolved

create dir on download

2e2929f

lucylq force-pushed the lfq.quantize-on-et-export branch from 3f81ec4 to 2af547e Compare April 25, 2024 18:45

kimishpatel requested changes Apr 25, 2024

View reviewed changes

docs/quantization.md Show resolved Hide resolved

lucylq requested a review from jerryzh168 April 25, 2024 18:47

lucylq force-pushed the lfq.quantize-on-et-export branch from 2af547e to b65f005 Compare April 25, 2024 18:54

kimishpatel reviewed Apr 25, 2024

View reviewed changes

docs/quantization.md Outdated Show resolved Hide resolved

lucylq force-pushed the lfq.quantize-on-et-export branch 2 times, most recently from 86a107d to 781be13 Compare April 25, 2024 19:04

jerryzh168 reviewed Apr 25, 2024

View reviewed changes

docs/quantization.md Outdated Show resolved Hide resolved

jerryzh168 reviewed Apr 25, 2024

View reviewed changes

docs/quantization.md Outdated Show resolved Hide resolved

jerryzh168 reviewed Apr 25, 2024

View reviewed changes

jerryzh168 approved these changes Apr 25, 2024

View reviewed changes

mikekgfb approved these changes Apr 25, 2024

View reviewed changes

quantization

50651f5

lucylq force-pushed the lfq.quantize-on-et-export branch from 781be13 to 50651f5 Compare April 25, 2024 21:15

lucylq requested a review from kimishpatel April 26, 2024 04:09

kimishpatel approved these changes Apr 26, 2024

View reviewed changes

lucylq merged commit ed70263 into main Apr 26, 2024

malfet deleted the lfq.quantize-on-et-export branch April 30, 2024 16:51

malfet pushed a commit that referenced this pull request Jul 17, 2024

Quantize llama3 on et export (#436)

0b3113a

* create dir on download * quantization

malfet pushed a commit that referenced this pull request Jul 17, 2024

Quantize llama3 on et export (#436)

92a298c

* create dir on download * quantization

malfet pushed a commit that referenced this pull request Jul 17, 2024

Quantize llama3 on et export (#436)

51b2107

* create dir on download * quantization

malfet pushed a commit that referenced this pull request Jul 17, 2024

Quantize llama3 on et export (#436)

7e4c8b1

* create dir on download * quantization

malfet pushed a commit that referenced this pull request Jul 17, 2024

Quantize llama3 on et export (#436)

127b366

* create dir on download * quantization

malfet pushed a commit that referenced this pull request Jul 17, 2024

Quantize llama3 on et export (#436)

6590950

* create dir on download * quantization

Quantize llama3 on et export #436

Quantize llama3 on et export #436

Uh oh!

Conversation

lucylq commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kimishpatel commented Apr 24, 2024

Uh oh!

iseeyuan commented Apr 24, 2024

Uh oh!

lucylq commented Apr 24, 2024

Uh oh!

kimishpatel commented Apr 24, 2024

Uh oh!

lucylq commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kimishpatel commented Apr 25, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerryzh168 Apr 25, 2024

Choose a reason for hiding this comment

Uh oh!

mikekgfb Apr 25, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jerryzh168 Apr 25, 2024

Choose a reason for hiding this comment

Uh oh!

lucylq Apr 25, 2024

Choose a reason for hiding this comment

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

mikekgfb left a comment

Choose a reason for hiding this comment

Uh oh!

mikekgfb Apr 25, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kimishpatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lucylq commented Apr 23, 2024 •

edited

Loading

lucylq commented Apr 24, 2024 •

edited

Loading