Fix llama quantize_per_token numerics

andrewor14 · facebook-github-bot · commit 3ff0f77ee71a · 2024-03-04T09:44:48.000-08:00
Summary:
The existing implementation can produce quantized values
outside the quant range, since we add the zero points after
clamping. This was not a problem for symmetric quantization
since zero points are 0 there, but causes dqlinear numerics
to diverge significantly from the lowered implementation for
asymmetric quantization.

Reviewed By: digantdesai

Differential Revision: D54320424

fbshipit-source-id: e8d9136354b0dac1993ef7825fc331f68d0d4c05
diff --git a/examples/models/llama2/quantize.py b/examples/models/llama2/quantize.py
@@ -234,8 +234,9 @@ def quantize_per_token(
     """
     _quant_min_max_bounds_check(quant_min, quant_max, dtype)
     _per_token_quant_qparam_dim_check(input, scales, zero_points)
-    input = torch.round(input / scales).clamp(quant_min, quant_max).to(dtype)
-    input = input + zero_points
+    input = (
+        torch.round(input / scales + zero_points).clamp(quant_min, quant_max).to(dtype)
+    )
     return input