Skip to content

Commit f32cffd

Browse files
pytorchbotSS-JIA
andauthored
[ET-VK][Llama] Apply XNNPACK partitoner as well when lowering to Vulkan (#6857)
* [ET-VK] Enforce GPU buffer limit when partitioning Pull Request resolved: #6829 ## Context In Vulkan, there is a limit on the number of elements a GPU buffer can have. If a GPU buffer exceeds this limit, then the API will either produce an error or undefined behaviour will ensue. ## Changes Along with `texture_limits`, introduce a configurable `buffer_limit` entry in the partitioner configuration. ghstack-source-id: 253568943 Differential Revision: [D65899828](https://our.internmc.facebook.com/intern/diff/D65899828/) * [ET-VK][Llama] Apply XNNPACK partitoner as well when lowering to Vulkan Pull Request resolved: #6830 ## Context The final logit linear layer in the Transformer architecture has extremely large tensors, since the output and weight tensors will have a tensor with dim equal to the vocabulary size, which may be extremely large. Because of this, image textures cannot be used to execute the op when running with the Vulkan delegate, so an implementation using buffer based tensors must be used. Unfortunately, Vulkan does not have a performant implementation of linear with buffer based tensors at the moment. As a result, if this final linear layer is executed in Vulkan, model inference is extremely slow. ## Changes The below diff will prevent the final logit linear layer from being delegated to Vulkan by enforcing a GPU buffer limit. This diff modifies the export llama script to apply the XNNPACK partitioner after the Vulkan partitioner if lowering to Vulkan, to ensure that remaining ops will be accelerated with XNNPACK. 4 bit quantization will also apply an additional Quantizer after applying the Vulkan quantizer (which will skip the final logit linear layer) so that the final logit linear can be quantized as well. ## Long Term This is a temporary measure while an optimized buffer based linear implementation is developed. Once the Vulkan implementation achieves parity with XNNPACK, the final logit linear will be delegated to Vulkan once more. ghstack-source-id: 253568942 Differential Revision: [D65899827](https://our.internmc.facebook.com/intern/diff/D65899827/) --------- Co-authored-by: Stephen Jia <[email protected]>
1 parent 0f2995f commit f32cffd

File tree

2 files changed

+15
-1
lines changed

2 files changed

+15
-1
lines changed

examples/models/llama/export_llama_lib.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -682,6 +682,10 @@ def _export_llama(args) -> LLMEdgeManager: # noqa: C901
682682
args.enable_dynamic_shape,
683683
)
684684
)
685+
# Apply XNNPACK after Vulkan so that undelegated ops can be accelerated by XNNPACK
686+
partitioners.append(
687+
get_xnnpack_partitioner(dynamic_quant_only_partitioner=False)
688+
)
685689
modelname = f"vulkan_{modelname}"
686690

687691
if args.mps:

examples/models/llama/source_transformation/quantize.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,17 @@ def quantize( # noqa C901
157157
model = gptq_quantizer.quantize(model, inputs)
158158
return model
159159
elif qmode == "vulkan_4w":
160-
model = VkInt4WeightOnlyQuantizer().quantize(model)
160+
q_group_size = 256 if group_size is None else group_size
161+
model = VkInt4WeightOnlyQuantizer(groupsize=q_group_size).quantize(model)
162+
163+
# Apply additional quantizer for linear layers that aren't lowered to Vulkan
164+
# at the moment
165+
from torchao.quantization.quant_api import Int8DynActInt4WeightQuantizer
166+
167+
model = Int8DynActInt4WeightQuantizer(
168+
precision=torch_dtype, groupsize=q_group_size
169+
).quantize(model)
170+
161171
return model
162172
else:
163173
raise Exception(f"Unrecognized quantize mode: {qmode}")

0 commit comments

Comments
 (0)