-
Notifications
You must be signed in to change notification settings - Fork 607
[ET-VK] Simplify conv2d weight prepacking (>2x pipeline-creation speedup) #3368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…edup) @SSJia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps. The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms: ``` [334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms: ``` [135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` Internal: This diff targets the Next Steps involving convolution from [ET-VK Model-Load Benchmarks](https://docs.google.com/document/d/11JIBPuCI-u6Xe15GKzFC8pQaEW5F3ipBljWm7Nu_1KM/edit#heading=h.hlhgkp1f0o05) to reduce model-load time. Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3368
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 999b6fd with merge base 44d4bac ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D56617129 |
…edup) ssjia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps. The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms: ``` [334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms: ``` [135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` Internal: This diff targets the Next Steps involving convolution from [ET-VK Model-Load Benchmarks](https://docs.google.com/document/d/11JIBPuCI-u6Xe15GKzFC8pQaEW5F3ipBljWm7Nu_1KM/edit#heading=h.hlhgkp1f0o05) to reduce model-load time. Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/) ghstack-source-id: 224037564 Pull Request resolved: #3368
…eation speedup)" ssjia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps. The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms: ``` [334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms: ``` [135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D56617129 |
…dup) Pull Request resolved: #3368 @SSJia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps. Internal: This diff targets the Next Steps involving convolution from [ET-VK Model-Load Benchmarks](https://docs.google.com/document/d/11JIBPuCI-u6Xe15GKzFC8pQaEW5F3ipBljWm7Nu_1KM/edit#heading=h.hlhgkp1f0o05) to reduce model-load time. ## Before The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms: ``` [334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` Shader file sizes were 3-5 KB. This is likely misleading for SPIR-V comparison since it includes very long comments. ``` [[email protected] ~/scratch/shaders]$ ls -l conv*_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 3579 Apr 26 12:59 conv2d_dw_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 4661 Apr 26 12:59 conv2d_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 3960 Apr 26 12:59 conv_transpose2d_prepack_weights_float.glsl ``` ## After The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms: ``` [135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` Shader file sizes are ~2.5 KB each. ``` [[email protected] ~/scratch/shaders]$ ls -l conv*_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 2443 Apr 26 12:53 conv2d_dw_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 2621 Apr 26 12:53 conv2d_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 2522 Apr 26 12:53 conv_transpose2d_prepack_weights_float.glsl ``` Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/) ghstack-source-id: 224133695
This pull request has been merged in 6c06f26. |
TIL smoke tests are not part of the CI. Forgot to update this in #3368 Differential Revision: [D56739385](https://our.internmc.facebook.com/intern/diff/D56739385/) [ghstack-poisoned]
TIL smoke tests are not part of the CI. Forgot to update this in #3368 Differential Revision: [D56739385](https://our.internmc.facebook.com/intern/diff/D56739385/) ghstack-source-id: 224416140 Pull Request resolved: #3415
Stack from ghstack (oldest at bottom):
@SSJia has previously written two implementations of convolution weights prepacking for CPU (before and after PyTorch PR #84973). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps.
The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms:
The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms:
Differential Revision: D56617129