Skip to content

Commit 999b6fd

Browse files
committed
Update on "[ET-VK] Simplify conv2d weight prepacking (>2x pipeline-creation speedup)"
ssjia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps. The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms: ``` [334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms: ``` [135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/) [ghstack-poisoned]
2 parents 5d00a17 + 613871f commit 999b6fd

File tree

5 files changed

+10
-11
lines changed

5 files changed

+10
-11
lines changed

backends/vulkan/runtime/graph/ops/glsl/conv2d_dw_prepack_weights.glsl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ void main() {
7070
const ivec4 w = p0 % W;
7171

7272
// Map modified tensor_idx to modifed buffer_i
73-
// Zero modified tensor idx that are out of bounds
73+
// Zero out if modified tensor idx is out of bounds
7474
const ivec4 buf_i = n * C*H*W + h * W + w;
7575
const bvec4 mask = bvec4(lessThan(n, ivec4(N)));
7676

@@ -84,7 +84,7 @@ void main() {
8484
if (mask.z) {
8585
texel.z = SCALAR_T(buffer_in[buf_i.z]);
8686
}
87-
if (mask.w ) {
87+
if (mask.w) {
8888
texel.w = SCALAR_T(buffer_in[buf_i.w]);
8989
}
9090

backends/vulkan/runtime/graph/ops/glsl/conv2d_prepack_weights.glsl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ void main() {
7474
const ivec4 w = p1 % W;
7575

7676
// Map modified tensor_idx to modified buffer_i
77-
// Zero modified tensor idx that are out of bounds
77+
// Zero out if modified tensor idx is out of bounds
7878
const ivec4 buf_i = n * C*H*W + c * H*W + h * W + w;
7979
const bvec4 mask = bvec4(ivec4(lessThan(n, ivec4(N))) & ivec4(lessThan(c, ivec4(C))));
8080

@@ -88,7 +88,7 @@ void main() {
8888
if (mask.z) {
8989
texel.z = SCALAR_T(buffer_in[buf_i.z]);
9090
}
91-
if (mask.w ) {
91+
if (mask.w) {
9292
texel.w = SCALAR_T(buffer_in[buf_i.w]);
9393
}
9494

backends/vulkan/runtime/graph/ops/glsl/conv_transpose2d_prepack_weights.glsl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ void main() {
7373
const ivec4 w = W-1 - p1 % W;
7474

7575
// Map modified tensor_idx to modifed buffer_i
76-
// Zero modified tensor idx that are out of bounds
76+
// Zero out if modified tensor idx is out of bounds
7777
const ivec4 buf_i = n * C*H*W + c * H*W + h * W + w;
7878
const bvec4 mask = bvec4(ivec4(lessThan(n, ivec4(N))) & ivec4(lessThan(c, ivec4(C))));
7979

@@ -87,7 +87,7 @@ void main() {
8787
if (mask.z) {
8888
texel.z = SCALAR_T(buffer_in[buf_i.z]);
8989
}
90-
if (mask.w ) {
90+
if (mask.w) {
9191
texel.w = SCALAR_T(buffer_in[buf_i.w]);
9292
}
9393

examples/cadence/ops/functions.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@
6060
- arg_meta: null
6161
kernel_name: impl::HiFi::quantized_layer_norm_out
6262

63-
- func: cadence::quantized_linear.out(Tensor src, Tensor weight, Tensor bias, float src_scale, int src_zero_point, float weight_scale, int weight_zero_point, Tensor out_multiplier, Tensor out_shift, int out_zero_point, *, Tensor(a!) out) -> Tensor(a!)
63+
- func: cadence::quantized_linear.out(Tensor src, Tensor weight, Tensor bias, int src_zero_point, Tensor weight_zero_point, Tensor out_multiplier, Tensor out_shift, int out_zero_point, Tensor? offset, *, Tensor(a!) out) -> Tensor(a!)
6464
kernels:
6565
- arg_meta: null
6666
kernel_name: impl::HiFi::quantized_linear_out

examples/cadence/ops/quantized_linear_out.cpp

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,12 @@ void quantized_linear_out(
2424
const Tensor& src,
2525
const Tensor& weight,
2626
const Tensor& bias,
27-
double src_scale,
2827
int64_t src_zero_point,
29-
double weight_scale,
30-
int64_t weight_zero_point,
28+
const Tensor& weight_zero_point,
3129
const Tensor& out_multiplier,
3230
const Tensor& out_shift,
3331
int64_t out_zero_point,
32+
const exec_aten::optional<Tensor>& offset,
3433
Tensor& out) {
3534
// input comes in shape [leading_dims, in_dim]
3635
// weight comes in shape [out_dim, in_dim]
@@ -58,7 +57,7 @@ void quantized_linear_out(
5857
in_dim, // vec_offset of p_mat2.
5958
out_dim, // out_offset, i.e., offset of next output element written
6059
1, // out_stride, i.e., stride to go to next output row
61-
-weight_zero_point, // mat1_zero_bias
60+
-weight_zero_point.const_data_ptr<int32_t>()[0], // mat1_zero_bias
6261
-src_zero_point, // mat2_zero_bias
6362
out_multiplier.const_data_ptr<int32_t>(), // out_multiplier
6463
out_shift.const_data_ptr<int32_t>(), // out_shift

0 commit comments

Comments
 (0)