Skip to content

Update quantize.py to use torchao Quantizers #882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jul 17, 2024
Merged

Update quantize.py to use torchao Quantizers #882

merged 22 commits into from
Jul 17, 2024

Conversation

larryliu0820
Copy link
Contributor

@larryliu0820 larryliu0820 commented Jul 3, 2024

Summary:

Remove duplicate code for Int8DynActInt4WeightQuantizer and use torchao API.

Test Plan:

python torchchat.py generate llama2 --quantize '{"linear:a8w4dq": {"groupsize": 256}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"cpu"}}' --prompt "Once upon a time," --max-new-tokens 256

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

pytorch-bot bot commented Jul 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/882

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job

As of commit f85339a with merge base ee681bf (image):

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@larryliu0820 larryliu0820 requested a review from Jack-Khuu July 3, 2024 23:21
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 3, 2024
@larryliu0820 larryliu0820 requested a review from jerryzh168 July 3, 2024 23:22
Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good contextually, we'll want a stamp from the AO side though

@Jack-Khuu Jack-Khuu requested a review from HDCharles July 3, 2024 23:49
byjlw
byjlw previously requested changes Jul 9, 2024
Copy link
Contributor

@byjlw byjlw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the readme and the --help to explain the --quantize flag?
Also when you pass --quantize during generate what exactly happens if the model on disk is fp16 and you want 8bit? Does it quantize then and there? Does it get saved for future use? We need enough information in the help and readme for users to know what to expect and how to do what they want.

@larryliu0820
Copy link
Contributor Author

Can you update the readme and the --help to explain the --quantize flag? Also when you pass --quantize during generate what exactly happens if the model on disk is fp16 and you want 8bit? Does it quantize then and there? Does it get saved for future use? We need enough information in the help and readme for users to know what to expect and how to do what they want.

Good question we will make sure that is added. Basically I believe --quantize only has effect on the model in memory but not the one on disk, unless we explicitly ask generate.py to save it.

jackzhxng
jackzhxng previously approved these changes Jul 15, 2024
Copy link
Contributor

@jackzhxng jackzhxng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed out a fix for the failing tests, all compile tests now pass for x86. We should be able to get this across the line now. Approved pending re-review from @Jack-Khuu and @byjlw

@jackzhxng jackzhxng dismissed their stale review July 16, 2024 00:01

Early approval, I thought only the test that I fixed was failing. Needs more work to pass tests.

larryliu0820 and others added 17 commits July 17, 2024 11:06
Summary:

Remove duplicate code for Int4WeightOnlyQuantizer and
Int8DynActInt4WeightQuantizer and use torchao API.

Test Plan:

```
python torchchat.py generate llama2 --quantize '{"linear:int4": {"groupsize": 256}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"cpu"}}' --prompt "Once upon a time," --max-new-tokens 256
python torchchat.py generate llama2 --quantize '{"linear:a8w4dq": {"groupsize": 256}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"cpu"}}' --prompt "Once upon a time," --max-new-tokens 256
```

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks legit, I'll can check on the perf when it land

@larryliu0820 larryliu0820 requested a review from byjlw July 17, 2024 22:34
@larryliu0820
Copy link
Contributor Author

@byjlw We will add documentation in next PR. Any other concerns?

@larryliu0820 larryliu0820 dismissed byjlw’s stale review July 17, 2024 22:48

Merge this PR asap

@larryliu0820 larryliu0820 merged commit e1914fa into main Jul 17, 2024
50 of 51 checks passed
@jackzhxng jackzhxng deleted the use_ao branch July 17, 2024 23:45
@jackzhxng jackzhxng restored the use_ao branch July 17, 2024 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants