Manually apply 4bit weight packing (#7274)

SS-JIA · facebook-github-bot · commit 654d8a691523 · 2024-12-11T09:23:26.000-08:00
Summary:

## Context

Currently, exporting llama models to Vulkan using 4 bit weight quantization is broken because the behaviour of the `groupwise_affine_quantize_tensor` utility function from `torchao` was recently changed so that the packing of two 4-bit integers into a single 8 bit value does not occur.

To fix, just have the `VkInt4WeightOnlyQuantizer` perform that step itself.

Reviewed By: jorgep31415

Differential Revision: D67051119
diff --git a/backends/vulkan/_passes/int4_weight_only_quantizer.py b/backends/vulkan/_passes/int4_weight_only_quantizer.py
@@ -226,6 +226,12 @@ def _create_quantized_state_dict(
                     self.groupsize,
                     self.precision,  # dtype for scales_and_zeros
                 )
+                # If the packing of 2 4-bit values into a single 8-bit value was not
+                # performed in the previous function call, then do it manually now.
+                if w_int4x8.shape == weight.shape:
+                    w_int4x8 = (w_int4x8[::, ::2] << 4 | w_int4x8[::, 1::2]).to(
+                        torch.uint8
+                    )
                 # In the original implementation, w_int4x8 is packed via calling the
                 # _convert_weight_to_int4pack operator before storing the weight. However
                 # the Vulkan implementation does not expect the weights to be packed, so