Skip to content

Quantize llama3 on et export #436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 26, 2024
Merged

Quantize llama3 on et export #436

merged 2 commits into from
Apr 26, 2024

Conversation

lucylq
Copy link
Contributor

@lucylq lucylq commented Apr 23, 2024

  • Recommended to quantize for llama models. Update examples to use llama3 rather than stories.
  • Update quantization docs; weight + embedding + examples.
  • Some quantization schemes do not work for et export (not intended to work); gptq, linear int4, hqq. Specify support.

Test with:

python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}' --output-pte-path llama3_q.pte       

Run:

python3 torchchat.py generate llama3 --device cpu --pte-path llama3_8da_qe4_32.pte --prompt "Hello my name is" --tokenizer-path ../llama-models/llama3/tiktokenizer.bin

<|begin_of_text|>Hello my name is Samantha, I am a trained healthcare professional with a passion for delivering exceptional patient care. I am excited to offer my services as a caregiver through this platform. I have extensive experience in providing assistance with daily living activities, such as bathing, dressing, grooming, meal preparation, medication management, and other essential tasks. I am also trained in first aid and CPR. I am available for short-term or long-term positions and can work flexible hours depending on your needs. I am a non-smoker and a non-drinker, and I have a clean driving record. I look forward to working with you and your loved ones.
Hello my name is Samantha, I am a trained healthcare professional with a passion for delivering exceptional patient care. I am excited to offer my services as a caregiver through this platform. I have extensive experience in providing assistance with daily living activities, such as bathing, dressing, grooming, meal preparation, medication management, and other essential tasks. I am also trained in first aid and CPR.
Time for inference 1: 19.38 sec total, 10.32 tokens/sec
Bandwidth achieved: 0.00 GB/s
==========
Average tokens/sec: 10.32
Memory used: 0.00 GB

With the different export quantization schemes in https://github.com/pytorch/torchchat/blob/main/docs/quantization.md
Not tested for aoti.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 23, 2024
@lucylq lucylq force-pushed the lfq.quantize-on-et-export branch 2 times, most recently from 0611004 to ed95bd1 Compare April 24, 2024 16:39
@lucylq lucylq marked this pull request as ready for review April 24, 2024 16:40
@lucylq lucylq requested a review from kimishpatel April 24, 2024 16:40
@kimishpatel
Copy link
Contributor

In summary can you put the repro step or how you tested it?

@lucylq lucylq force-pushed the lfq.quantize-on-et-export branch 4 times, most recently from 93db116 to 93f7847 Compare April 24, 2024 17:14
@iseeyuan
Copy link
Contributor

Could you also upadte examples/models/llama2/README, where for llama3 models, --embedding-quantize 4,32 is suggested to reduce the model size significantly due to the larger vocabulary size.

@lucylq
Copy link
Contributor Author

lucylq commented Apr 24, 2024

Could you also upadte examples/models/llama2/README, where for llama3 models, --embedding-quantize 4,32 is suggested to reduce the model size significantly due to the larger vocabulary size.

thanks Martin, added in pytorch/executorch#3315

@kimishpatel
Copy link
Contributor

python3 torchchat.py export llama3 --quantize '{"linear:gptq": {"groupsize" : 32} }' --output-pte-path llama3_q.pte

Did you try running the generated pte?

@lucylq
Copy link
Contributor Author

lucylq commented Apr 24, 2024

python3 torchchat.py export llama3 --quantize '{"linear:gptq": {"groupsize" : 32} }' --output-pte-path llama3_q.pte

Did you try running the generated pte?

So, the gptq, linear int4, and hqq paths don't export well through executorch. I'll try run the 8da4w+qe file - updated in summary.

@lucylq lucylq force-pushed the lfq.quantize-on-et-export branch 4 times, most recently from 65ce384 to 224b258 Compare April 25, 2024 00:13
@lucylq lucylq force-pushed the lfq.quantize-on-et-export branch 4 times, most recently from cc0e202 to 5bf7895 Compare April 25, 2024 00:43
@kimishpatel
Copy link
Contributor

@mikekgfb I asked Lucy to make these changes. Quite a bit of stuff toward the latter half of the doc is removed. I think not having a very long doc was better since those were just showcasing example commands on how to invoke each quant scheme. If you differ in the opinion let me know.

Also, is HQQ done or WIP?

@lucylq lucylq force-pushed the lfq.quantize-on-et-export branch 2 times, most recently from f99752f to 3f81ec4 Compare April 25, 2024 16:27
@lucylq lucylq force-pushed the lfq.quantize-on-et-export branch from 3f81ec4 to 2af547e Compare April 25, 2024 18:45
@lucylq lucylq requested a review from jerryzh168 April 25, 2024 18:47
@lucylq lucylq force-pushed the lfq.quantize-on-et-export branch from 2af547e to b65f005 Compare April 25, 2024 18:54
@lucylq lucylq force-pushed the lfq.quantize-on-et-export branch 2 times, most recently from 86a107d to 781be13 Compare April 25, 2024 19:04
Depending on the model and the target device, different quantization recipes may be applied. Torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/cuda.json` , and mobile systems `config/data/mobile.json`. The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory). The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory.

You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the export comamnds below, to optimize the exported models. To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md).
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we add eval as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to you for adding @jerryzh168

| linear (asymmetric) | `'{"linear:int<bitwidth>" : {"groupsize" : <groupsize>}}'` |
| linear with dynamic activations (symmetric) | `'{"linear:a8w4dq" : {"groupsize" : <groupsize>}}'`|
| linear with GPTQ (asymmetric) | `'{"linear:int4-gptq" : {"groupsize" : <groupsize>}}'`|
| linear with HQQ (asymmetric) |`'{"linear:hqq" : {"groupsize" : <groupsize>}}'`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to tag hqq with bitwidth as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmn, where is hqq bitwidth added?

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall I think, I think the overall structure is better, we just explain multiple dimensions of config and give a few examples of showing how they can be combined

Copy link
Contributor

@mikekgfb mikekgfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that whatever changes in commands, you made are tested.

Thank you!

Depending on the model and the target device, different quantization recipes may be applied. Torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/cuda.json` , and mobile systems `config/data/mobile.json`. The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory). The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory.

You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the export comamnds below, to optimize the exported models. To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md).
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to you for adding @jerryzh168

@lucylq lucylq force-pushed the lfq.quantize-on-et-export branch from 781be13 to 50651f5 Compare April 25, 2024 21:15
@lucylq lucylq requested a review from kimishpatel April 26, 2024 04:09
Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes

@lucylq lucylq merged commit ed70263 into main Apr 26, 2024
@malfet malfet deleted the lfq.quantize-on-et-export branch April 30, 2024 16:51
malfet pushed a commit that referenced this pull request Jul 17, 2024
* create dir on download

* quantization
malfet pushed a commit that referenced this pull request Jul 17, 2024
* create dir on download

* quantization
malfet pushed a commit that referenced this pull request Jul 17, 2024
* create dir on download

* quantization
malfet pushed a commit that referenced this pull request Jul 17, 2024
* create dir on download

* quantization
malfet pushed a commit that referenced this pull request Jul 17, 2024
* create dir on download

* quantization
malfet pushed a commit that referenced this pull request Jul 17, 2024
* create dir on download

* quantization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants