Update on "[ET-VK][Ops] aten.convolution (SlidingWindow)"

jorgep31415 · jorgep31415 · commit 1cd8c02bb9be · 2024-04-02T17:24:53.000-07:00
## The Operator `nn.Module` invocations of [`nn.Conv2d`](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html#torch.nn.Conv2d) and [`nn.ConvTranspose2d`](https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html#torch.nn.ConvTranspose2d) get compiled to `aten.convolution.default` in the Edge Dialect, which carries the signature ``` - func: convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, SymInt[] padding, int[] dilation, bool transposed, SymInt[] output_padding, int groups) -> Tensor ``` ## Summary (cases handled) We introduce support for the convolution cases covered by [ATen-VK's default SlidingWindow implementation](https://github.com/pytorch/pytorch/blob/09c72eaa3f69f90402c86a30abf4fc621298578c/aten/src/ATen/native/vulkan/ops/Convolution.cpp#L73). This is achieved by - reusing the [existing `conv2d.glsl`](https://github.com/pytorch/pytorch/blob/09c72eaa3f69f90402c86a30abf4fc621298578c/aten/src/ATen/native/vulkan/glsl/conv2d.glsl), and - [moving special weights prepacking from CPU](https://github.com/pytorch/pytorch/blob/09c72eaa3f69f90402c86a30abf4fc621298578c/aten/src/ATen/native/vulkan/ops/Convolution.cpp#L134-L235) to the GPU in `conv2d_prepack_weights.glsl`. We also include resizing support for dynamic shapes. Note that only height and width of the input can vary. ## Cases not handled The implementation is on-par with ATen-VK's SlidingWindow. This means the following cases are missing: 1. **Groups G > 1.** Largely not covered by ATen-VK. `G = in_channels` is covered by ATen-VK's Depthwise impl and will be added soon. 2. **Batch (input) N > 1.** Not covered by ATen-VK. 3. **Padding > 0 while Dilation, Kernel > 1.** Not covered by ATen-VK. ## Coming soon For our CUNET model, the first two are required and the third is useful. 1. Transpose convolution 2. Depthwise convolution (for completeness) 3. Pointwise convolution (for optimization) 4. Null bias Differential Revision: [D55346778](https://our.internmc.facebook.com/intern/diff/D55346778/) [ghstack-poisoned]
diff --git a/backends/vulkan/runtime/graph/ops/glsl/conv2d.glsl b/backends/vulkan/runtime/graph/ops/glsl/conv2d.glsl
@@ -78,12 +78,12 @@ void main() {
   kstart.y += pos.z * params.kernel_size.y;
 
   // Perform the convolution by iterating over the overlay region.
-  vec4 sum = texelFetch(bias_in, ivec2(pos.z, 0), 0);
+  ${VEC4_T[DTYPE]} sum = texelFetch(bias_in, ivec2(pos.z, 0), 0);
   const int ic4 = extra_params.in_group_size / 4;
   for (int z4 = 0; z4 < ic4; ++z4, kstart.x += params.kernel_size.x * 4) {
     for (int y = start.y, ky = kstart.y; y < end.y; y += params.dilation.y, ++ky) {
       for (int x = start.x, kx = kstart.x; x < end.x; x += params.dilation.x, kx += 4) {
-        const vec4 in_texel = texelFetch(image_in, ivec3(x, y, z4), 0);
+        const ${VEC4_T[DTYPE]} in_texel = texelFetch(image_in, ivec3(x, y, z4), 0);
 
         // To explain the calculation below, the contents of in_texel and the
         // group of 4 texels loaded from kernel_in are shown:
@@ -115,18 +115,18 @@ void main() {
         //  | x | | A0 |   | y | | A1 |   | z | | A2 |   | w | | A3 |
         //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
         //
-        //  which is what is expressed in the following calculations.
+        // which is expressed in the following statements.
 
-        const vec4 ktex_0 = texelFetch(kernel_in, ivec2(kx + 0, ky), 0);
+        const ${VEC4_T[DTYPE]} ktex_0 = texelFetch(kernel_in, ivec2(kx + 0, ky), 0);
         sum = fma(in_texel.xxxx, ktex_0, sum);
 
-        const vec4 ktex_1 = texelFetch(kernel_in, ivec2(kx + 1, ky), 0);
+        const ${VEC4_T[DTYPE]} ktex_1 = texelFetch(kernel_in, ivec2(kx + 1, ky), 0);
         sum = fma(in_texel.yyyy, ktex_1, sum);
 
-        const vec4 ktex_2 = texelFetch(kernel_in, ivec2(kx + 2, ky), 0);
+        const ${VEC4_T[DTYPE]} ktex_2 = texelFetch(kernel_in, ivec2(kx + 2, ky), 0);
         sum = fma(in_texel.zzzz, ktex_2, sum);
 
-        const vec4 ktex_3 = texelFetch(kernel_in, ivec2(kx + 3, ky), 0);
+        const ${VEC4_T[DTYPE]} ktex_3 = texelFetch(kernel_in, ivec2(kx + 3, ky), 0);
         sum = fma(in_texel.wwww, ktex_3, sum);
       }
     }
diff --git a/backends/vulkan/runtime/graph/ops/glsl/conv2d.yaml b/backends/vulkan/runtime/graph/ops/glsl/conv2d.yaml
@@ -16,15 +16,3 @@ conv2d:
         SUFFIX: float
   shader_variants:
     - NAME: conv2d
-
-conv2d_prepack_weights:
-  parameter_names_with_default_values:
-    DTYPE: float
-  generate_variant_forall:
-    DTYPE:
-      - VALUE: half
-        SUFFIX: half
-      - VALUE: float
-        SUFFIX: float
-  shader_variants:
-    - NAME: conv2d_prepack_weights
diff --git a/backends/vulkan/runtime/graph/ops/glsl/conv2d_prepack_weights.glsl b/backends/vulkan/runtime/graph/ops/glsl/conv2d_prepack_weights.glsl
@@ -47,7 +47,7 @@ layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
  * rest of this comment. Refer to the code-level comments, for how we translate
  * it to GPU by reversing the steps.
  *
- * Consider example weight tensor of size {10,7,3,3}. The following
+ * Consider an example weight tensor of size {10,7,3,3}. The following
  * transformations will be applied.
  *
  * 1. Pad the N and C dims so that both are a multiple of 4. In this case, 2
diff --git a/backends/vulkan/runtime/graph/ops/glsl/conv2d_prepack_weights.yaml b/backends/vulkan/runtime/graph/ops/glsl/conv2d_prepack_weights.yaml
@@ -0,0 +1,17 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+conv2d_prepack_weights:
+  parameter_names_with_default_values:
+    DTYPE: float
+  generate_variant_forall:
+    DTYPE:
+      - VALUE: half
+        SUFFIX: half
+      - VALUE: float
+        SUFFIX: float
+  shader_variants:
+    - NAME: conv2d_prepack_weights
diff --git a/backends/vulkan/runtime/graph/ops/glsl/max_pool2d.yaml b/backends/vulkan/runtime/graph/ops/glsl/max_pool2d.yaml
diff --git a/backends/vulkan/test/utils/test_utils.cpp b/backends/vulkan/test/utils/test_utils.cpp
@@ -138,7 +138,6 @@ void record_conv2d_prepack_weights_op(
           api::MemoryAccessType::WRITE),
       src_buffer,
       v_dst.gpu_sizes_ubo()->buffer(),
-      v_dst.cpu_sizes_ubo()->buffer(),
       original_sizes_ubo.buffer(),
       padded_sizes_ubo.buffer());
 }

Original file line number	Diff line number	Diff line change
`@@ -47,7 +47,7 @@ layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;`
`47`	`47`	`* rest of this comment. Refer to the code-level comments, for how we translate`
`48`	`48`	`* it to GPU by reversing the steps.`
`49`	`49`	`*`
`50`		`- * Consider example weight tensor of size {10,7,3,3}. The following`
	`50`	`+ * Consider an example weight tensor of size {10,7,3,3}. The following`
`51`	`51`	`* transformations will be applied.`
`52`	`52`	`*`
`53`	`53`	`* 1. Pad the N and C dims so that both are a multiple of 4. In this case, 2`
Original file line number	Diff line number	Diff line change
`@@ -138,7 +138,6 @@ void record_conv2d_prepack_weights_op(`
`138`	`138`	`api::MemoryAccessType::WRITE),`
`139`	`139`	`src_buffer,`
`140`	`140`	`v_dst.gpu_sizes_ubo()->buffer(),`
`141`		`- v_dst.cpu_sizes_ubo()->buffer(),`
`142`	`141`	`original_sizes_ubo.buffer(),`
`143`	`142`	`padded_sizes_ubo.buffer());`
`144`	`143`	`}`