Skip to content

Commit 34db73d

Browse files
kimishpatelfacebook-github-bot
authored andcommitted
Fix 4bit groupwise dynamic linear quantization (#2251)
Summary: Pull Request resolved: #2251 This diff fixes following issues: - removes scales packing/unpacking - separate compute precision from scales storage precision, instead of maintaining activation/weight precision - defaults to fp32 everywhere unless specified otherwise. This is because atm groupwise quant kernels in xnnpack are for fp32. - Removes some dead code - Remove k tile constraints: These were from GPU and are not needed here - Replaces torch.ops.aten.linear with nn.functional.linear: This had to be done because otherwise delegation doesnt recognize the pattern. Yet another issue with pattern matching. ghstack-source-id: 217579450 exported-using-ghexport bypassing check because oss failures are unrelated bypass-github-export-checks Reviewed By: cccclai Differential Revision: D54427828 fbshipit-source-id: 634c34212e6ec80c41b21ae1dd1ad3211bf04862
1 parent bcba739 commit 34db73d

File tree

2 files changed

+98
-109
lines changed

2 files changed

+98
-109
lines changed

examples/models/llama2/export_llama_lib.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,8 @@ def quantize(
191191
return WeightOnlyInt8QuantHandler(model).quantized_model()
192192
elif qmode == "int4":
193193
model_int4 = Int8DynActInt4WeightQuantHandler(
194-
model, activation_precision=torch_dtype
194+
model,
195+
precision=torch_dtype,
195196
).quantized_model()
196197
print("quantized model:", model_int4)
197198
return model_int4
@@ -397,10 +398,6 @@ def _export_llama(modelname, args) -> str: # noqa: C901
397398
modelname = f"xnnpack_{modelname}"
398399

399400
# TODO: remove this after xnnpack delegation is ready
400-
if args.quantization_mode == "int4":
401-
raise Exception(
402-
"some quantized ops should be lowered to xnnpack, but xnnpack delegate is not ready yet"
403-
)
404401

405402
builder = (
406403
load_llama_model(

0 commit comments

Comments
 (0)