-
Notifications
You must be signed in to change notification settings - Fork 608
Enable GPTQ in executorch #2425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2425
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 7503eeb with merge base 39c93aa ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
examples/models/llama2/quantize.py
Outdated
return torch.empty_like(input, dtype=dtype) | ||
|
||
|
||
def group_quantize_tensor_symmetric( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted here, but still used by prepare_int4_weight_and_scales_and_zeros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably not used, we'll cleanup a bit later
examples/models/llama2/quantize.py
Outdated
return torch.stack([first_elements, second_elements], dim=-1).view(up_size(shape)) | ||
|
||
|
||
def per_token_dynamic_quant(input: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted here, but still used by linear_forward_8da4w
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll import from torchao
examples/models/llama2/quantize.py
Outdated
return torch.empty_like(input, dtype=output_dtype) | ||
|
||
|
||
def get_group_qparams_symmetric(w, n_bit=4, groupsize=128, precision=torch.float32): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted here, but still used by Int8DynActInt4WeightGPTQQuantHandler
examples/models/llama2/quantize.py
Outdated
) | ||
|
||
|
||
def pack_scales_and_zeros(scales, zeros, precision=torch.float16): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted here, but still used by Int8DynActInt4WeightGPTQQuantHandler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine, we are not using this QuantHandler, I'll remove later as well
@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
9b6c568
to
9e18ed0
Compare
@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Previously we just added the code but didn't test it, this PR also tests gptq locally to make sure we can produce a model using gptq from torchao repo Pull Request resolved: pytorch#2425 Test Plan: python3 -m examples.models.llama2.export_llama -c stories110M.pt -p params.json -qmode 8da4w-gptq -X -d fp32 Reviewed By: manuelcandales Differential Revision: D54922375 Pulled By: jerryzh168
Summary: Previously we just added the code but didn't test it, this PR also tests gptq locally to make sure we can produce a model using gptq from torchao repo Pull Request resolved: pytorch#2425 Test Plan: python3 -m examples.models.llama2.export_llama -c stories110M.pt -p params.json -qmode 8da4w-gptq -X -d fp32 Reviewed By: manuelcandales Differential Revision: D54922375 Pulled By: jerryzh168
Summary: Previously we just added the code but didn't test it, this PR also tests gptq locally to make sure we can produce a model using gptq from torchao repo Currently blocked on xnnpack lowering Test Plan: python3 -m examples.models.llama2.export_llama -c stories110M.pt -p params.json -qmode 8da4w-gptq -X Reviewers: Subscribers: Tasks: Tags:
@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
assert ( | ||
start_pos is None and cache_k is None and cache_v is None | ||
), "Caches and start_pos are unused when use_kv_cache is False" | ||
# assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kimishpatel is this OK? I need to comment this out to make sure we can run the torch._dyncmo.export in GPTQ.
@HDCharles is going to work on a refactor of GPTQ to remove export and use tensor subclass instead, we can revert this change when that is implemented I think.
@jerryzh168 merged this pull request in 246ed45. |
Summary:
Previously we just added the code but didn't test it, this PR also tests gptq locally to make sure we can produce a model using gptq from torchao repo
Test Plan:
python3 -m examples.models.llama2.export_llama -c stories110M.pt -p params.json -qmode 8da4w-gptq -X -d fp32
Reviewers:
Subscribers:
Tasks:
Tags: