Change weight to channel-packing in Conv1d (#7057)

yipjustin · facebook-github-bot · commit c19759d8af19 · 2024-11-25T19:10:29.000-08:00
Summary:

In a model we evaluate, we have a weight tensor (256, 1, 7) for conv 1d, 256 is the out-channel count and 7 is the weight.

It leads to a non-optimal use of memory since this tensor is mapped to extents of `(7 / 4, 1, 256)` under weight-packing, using 1MB per tensor. Reason is that each (x, y) plane uses 4096 bytes in the test device (for both 'OPTIMAL' and 'LINEAR' tiling), despite we are using only 2 texels in each plane.

A temporarily work-around is to use channel-packing instead. Then new tensor will be `(7, 1, 64)`, 75% less deep hence consume far less memory. Knowing that we will fetch 4 times more. But lab test shows that our model has no perf regression.

## Future work: 
A more optimal solution is mapping the weight tensor `(out-channel, in-channel, kernel)` into extents `(x=out-channel, y=kernel, z=in-channel)`.  In our case, it leads to close to optimal layout.

Reviewed By: nathanaelsee, jorgep31415

Differential Revision: D66417572
diff --git a/backends/vulkan/runtime/graph/ops/glsl/conv1d.glsl b/backends/vulkan/runtime/graph/ops/glsl/conv1d.glsl
@@ -101,23 +101,25 @@ void main() {
         // "k" tracks the kernel's index for our input-kernel computation.
         // It reads out-of-bound zeros, but trying to avoid them complicates
         // for-loop conditions, which results in worse performance.
-        for (int k = 0; k < kernel_size; k += 4) {
-          // Since the weight tensor is width-packed, which is along the length
-          // dimension, we can batch-read four elements at a time.
-          const ivec3 w_lpos = ivec3(k / 4, in_c % in_group_size, out_c);
-          const VEC4_T weight = load_texel_lpos(kernel_in, w_lpos, kernel_axis_map);
 
-          ivec3 in_pos = lpos_to_pos(ivec3(in_l + k * dilation, in_c, n / 4), in_axis_map);
-          sum = fma(weight.xxxx, load_texel(t_in, in_pos), sum);
-
-          in_pos[in_axis_map.x] += dilation;
-          sum = fma(weight.yyyy, load_texel(t_in, in_pos), sum);
+        // The weight tensor is channel-packed. It may not be trival choice for
+        // performance reason since need to have more data fetch. The reason is
+        // for some sequence model, we found that the weight tensor
+        // (out_channel, in_channel / group, kernel) often has a large
+        // out_channel >> kernel, leading to non-optimal use of memory as the
+        // weight tensor gets very deep. As a mitigation, we use channel-packing
+        // for the weight tensor, yielding a 75% reduction in weight-tensor
+        // memory.
+
+        // It is possible to further reduce the memory footprint by swapping the
+        // dimensions, using x extent for out_channel, and y for kernel.
+        for (int k = 0; k < kernel_size; k += 1) {
+          const ivec3 w_lpos = ivec3(k, in_c % in_group_size, out_c / 4);
+          const VEC4_T weight_texel = load_texel_lpos(kernel_in, w_lpos, kernel_axis_map);
+          VEC4_T weight = VEC4_T(weight_texel[out_c % 4]);
 
-          in_pos[in_axis_map.x] += dilation;
-          sum = fma(weight.zzzz, load_texel(t_in, in_pos), sum);
-
-          in_pos[in_axis_map.x] += dilation;
-          sum = fma(weight.wwww, load_texel(t_in, in_pos), sum);
+          ivec3 in_pos = lpos_to_pos(ivec3(in_l + k * dilation, in_c, n / 4), in_axis_map);
+          sum = fma(weight, load_texel(t_in, in_pos), sum);
         }
       }
 
diff --git a/backends/vulkan/runtime/graph/ops/impl/Convolution.cpp b/backends/vulkan/runtime/graph/ops/impl/Convolution.cpp
@@ -407,7 +407,7 @@ void add_conv1d_node(
     const ValueRef out,
     const bool clamp_out) {
   ValueRef arg_weight = prepack_standard(
-      graph, weight, graph.storage_type_of(out), utils::kWidthPacked);
+      graph, weight, graph.storage_type_of(out), utils::kChannelsPacked);
   ValueRef arg_bias = prepack_biases(
       graph,
       bias,