You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ssjia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps.
The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms:
```
[334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)
```
The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms:
```
[135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)
```
Internal:
This diff targets the Next Steps involving convolution from [ET-VK Model-Load Benchmarks](https://docs.google.com/document/d/11JIBPuCI-u6Xe15GKzFC8pQaEW5F3ipBljWm7Nu_1KM/edit#heading=h.hlhgkp1f0o05) to reduce model-load time.
Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/)
ghstack-source-id: 224037564
Pull Request resolved: #3368
0 commit comments