Pass ideep:lowp_kind to matmul_forward::compute on cache misses (pytorch#135058)

fadara01 · pytorchmergebot · commit 3d2431380999 · 2024-09-12T20:30:20.000Z
Optimized dynamic quantization for aarch64 was enabled by pytorch#126687 and pytorch#134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="*50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: **Without this PR**: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, **With this PR** the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: pytorch#135058 Approved by: https://github.com/jondea, https://github.com/malfet
diff --git a/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp b/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
@@ -590,10 +590,21 @@ at::Tensor PackedLinearWeightsOnednn::apply_dynamic_impl(
     LinearParams& params = get_cache().get_param();
     ideep::matmul_forward::compute(params, x, w, b, y, src_scales, src_zero_point);
   } else {
-    ideep::matmul_forward::compute(x, w, b, y,
-                                   src_scales, weights_scales, ideep::scale_t(),
-                                   src_zero_point, ideep::zero_point_t(),
-                                   1.0f, 1.0f, op_attr);
+    ideep::matmul_forward::compute(
+        x,
+        w,
+        b,
+        y,
+        src_scales,
+        weights_scales,
+        ideep::scale_t(),
+        src_zero_point,
+        ideep::zero_point_t(),
+        1.0f,
+        1.0f,
+        op_attr,
+        ideep::tensor::data_type::undef,
+        std::is_signed_v<input_qtype> ? ideep::s8s8 : ideep::u8s8);
   }
   auto out_sizes = input.sizes().vec();
   out_sizes.back() = w.get_dim(1);