-
Notifications
You must be signed in to change notification settings - Fork 250
Update quantize.py to use torchao Quantizers #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/882
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Cancelled JobAs of commit f85339a with merge base ee681bf ( CANCELLED JOB - The following job was cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good contextually, we'll want a stamp from the AO side though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the readme and the --help to explain the --quantize flag?
Also when you pass --quantize during generate what exactly happens if the model on disk is fp16 and you want 8bit? Does it quantize then and there? Does it get saved for future use? We need enough information in the help and readme for users to know what to expect and how to do what they want.
Good question we will make sure that is added. Basically I believe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed out a fix for the failing tests, all compile tests now pass for x86. We should be able to get this across the line now. Approved pending re-review from @Jack-Khuu and @byjlw
Early approval, I thought only the test that I fixed was failing. Needs more work to pass tests.
Summary: Remove duplicate code for Int4WeightOnlyQuantizer and Int8DynActInt4WeightQuantizer and use torchao API. Test Plan: ``` python torchchat.py generate llama2 --quantize '{"linear:int4": {"groupsize": 256}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"cpu"}}' --prompt "Once upon a time," --max-new-tokens 256 python torchchat.py generate llama2 --quantize '{"linear:a8w4dq": {"groupsize": 256}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"cpu"}}' --prompt "Once upon a time," --max-new-tokens 256 ``` Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks legit, I'll can check on the perf when it land
@byjlw We will add documentation in next PR. Any other concerns? |
Summary:
Remove duplicate code for Int8DynActInt4WeightQuantizer and use torchao API.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags: