-
Notifications
You must be signed in to change notification settings - Fork 251
Quantize llama3 on et export #436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0611004
to
ed95bd1
Compare
In summary can you put the repro step or how you tested it? |
93db116
to
93f7847
Compare
Could you also upadte examples/models/llama2/README, where for llama3 models, |
thanks Martin, added in pytorch/executorch#3315 |
Did you try running the generated pte? |
So, the gptq, linear int4, and hqq paths don't export well through executorch. I'll try run the 8da4w+qe file - updated in summary. |
65ce384
to
224b258
Compare
cc0e202
to
5bf7895
Compare
@mikekgfb I asked Lucy to make these changes. Quite a bit of stuff toward the latter half of the doc is removed. I think not having a very long doc was better since those were just showcasing example commands on how to invoke each quant scheme. If you differ in the opinion let me know. Also, is HQQ done or WIP? |
f99752f
to
3f81ec4
Compare
3f81ec4
to
2af547e
Compare
2af547e
to
b65f005
Compare
86a107d
to
781be13
Compare
Depending on the model and the target device, different quantization recipes may be applied. Torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/cuda.json` , and mobile systems `config/data/mobile.json`. The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory). The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory. | ||
|
||
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the export comamnds below, to optimize the exported models. To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md). | ||
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we add eval
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to you for adding @jerryzh168
| linear (asymmetric) | `'{"linear:int<bitwidth>" : {"groupsize" : <groupsize>}}'` | | ||
| linear with dynamic activations (symmetric) | `'{"linear:a8w4dq" : {"groupsize" : <groupsize>}}'`| | ||
| linear with GPTQ (asymmetric) | `'{"linear:int4-gptq" : {"groupsize" : <groupsize>}}'`| | ||
| linear with HQQ (asymmetric) |`'{"linear:hqq" : {"groupsize" : <groupsize>}}'`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably want to tag hqq with bitwidth as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmn, where is hqq bitwidth added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall I think, I think the overall structure is better, we just explain multiple dimensions of config and give a few examples of showing how they can be combined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure that whatever changes in commands, you made are tested.
Thank you!
Depending on the model and the target device, different quantization recipes may be applied. Torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/cuda.json` , and mobile systems `config/data/mobile.json`. The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory). The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory. | ||
|
||
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the export comamnds below, to optimize the exported models. To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md). | ||
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to you for adding @jerryzh168
781be13
to
50651f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes
* create dir on download * quantization
* create dir on download * quantization
* create dir on download * quantization
* create dir on download * quantization
* create dir on download * quantization
* create dir on download * quantization
Test with:
Run:
With the different export quantization schemes in https://github.com/pytorch/torchchat/blob/main/docs/quantization.md
Not tested for aoti.