[ET-VK] Simplify conv2d weight prepacking (>2x pipeline-creation speedup) #3368

jorgep31415 · 2024-04-26T04:28:46Z

Stack from ghstack (oldest at bottom):

-> [ET-VK] Simplify conv2d weight prepacking (>2x pipeline-creation speedup) #3368

@SSJia has previously written two implementations of convolution weights prepacking for CPU (before and after PyTorch PR #84973). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps.

The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms:

[334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)

The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms:

[135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)

Differential Revision: D56617129

@SSJia

…edup) @SSJia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps. The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms: ``` [334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms: ``` [135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` Internal: This diff targets the Next Steps involving convolution from [ET-VK Model-Load Benchmarks](https://docs.google.com/document/d/11JIBPuCI-u6Xe15GKzFC8pQaEW5F3ipBljWm7Nu_1KM/edit#heading=h.hlhgkp1f0o05) to reduce model-load time. Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/) [ghstack-poisoned]

pytorch-bot · 2024-04-26T04:28:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3368

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 999b6fd with merge base 44d4bac ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-04-26T04:29:00Z

This pull request was exported from Phabricator. Differential Revision: D56617129

…edup) ssjia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps. The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms: ``` [334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms: ``` [135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` Internal: This diff targets the Next Steps involving convolution from [ET-VK Model-Load Benchmarks](https://docs.google.com/document/d/11JIBPuCI-u6Xe15GKzFC8pQaEW5F3ipBljWm7Nu_1KM/edit#heading=h.hlhgkp1f0o05) to reduce model-load time. Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/) ghstack-source-id: 224037564 Pull Request resolved: #3368

…eation speedup)" ssjia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps. The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms: ``` [334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms: ``` [135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/) [ghstack-poisoned]

facebook-github-bot · 2024-04-26T20:10:23Z

This pull request was exported from Phabricator. Differential Revision: D56617129

@SSJia

…dup) Pull Request resolved: #3368 @SSJia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps. Internal: This diff targets the Next Steps involving convolution from [ET-VK Model-Load Benchmarks](https://docs.google.com/document/d/11JIBPuCI-u6Xe15GKzFC8pQaEW5F3ipBljWm7Nu_1KM/edit#heading=h.hlhgkp1f0o05) to reduce model-load time. ## Before The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms: ``` [334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` Shader file sizes were 3-5 KB. This is likely misleading for SPIR-V comparison since it includes very long comments. ``` [[email protected] ~/scratch/shaders]$ ls -l conv*_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 3579 Apr 26 12:59 conv2d_dw_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 4661 Apr 26 12:59 conv2d_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 3960 Apr 26 12:59 conv_transpose2d_prepack_weights_float.glsl ``` ## After The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms: ``` [135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1) [83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1) [102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1) [69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1) [115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1) [94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1) ``` Shader file sizes are ~2.5 KB each. ``` [[email protected] ~/scratch/shaders]$ ls -l conv*_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 2443 Apr 26 12:53 conv2d_dw_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 2621 Apr 26 12:53 conv2d_prepack_weights_float.glsl -rw-r--r-- 1 jorgep31415 users 2522 Apr 26 12:53 conv_transpose2d_prepack_weights_float.glsl ``` Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/) ghstack-source-id: 224133695

facebook-github-bot · 2024-04-26T21:58:40Z

This pull request has been merged in 6c06f26.

TIL smoke tests are not part of the CI. Forgot to update this in #3368 Differential Revision: [D56739385](https://our.internmc.facebook.com/intern/diff/D56739385/) [ghstack-poisoned]

TIL smoke tests are not part of the CI. Forgot to update this in #3368 Differential Revision: [D56739385](https://our.internmc.facebook.com/intern/diff/D56739385/) ghstack-source-id: 224416140 Pull Request resolved: #3415

Summary: Pull Request resolved: #3415 TIL smoke tests are not part of the CI. Forgot to update this in #3368 Reviewed By: yipjustin Differential Revision: D56739385 fbshipit-source-id: af45047d59ce1da873bf8d7bd8e68d5b07a31184

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 26, 2024

facebook-github-bot added the fb-exported label Apr 26, 2024

jorgep31415 changed the title ~~[ML-GPU] Simplify conv2d weight prepacking (>2x pipeline-creation speedup)~~ Simplify conv2d weight prepacking (>2x pipeline-creation speedup) Apr 26, 2024

jorgep31415 changed the title ~~Simplify conv2d weight prepacking (>2x pipeline-creation speedup)~~ [ET-VK] Simplify conv2d weight prepacking (>2x pipeline-creation speedup) Apr 26, 2024

SS-JIA self-requested a review April 26, 2024 16:35

SS-JIA approved these changes Apr 26, 2024

View reviewed changes

facebook-github-bot closed this in 6c06f26 Apr 26, 2024

facebook-github-bot added the Merged label Apr 26, 2024

jorgep31415 mentioned this pull request Apr 26, 2024

[ET-VK] Nit staging glsl mix optimization #3384

Closed

jorgep31415 mentioned this pull request Apr 30, 2024

[ET-VK][EZ] Fix conv2d_prepack_test #3415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK] Simplify conv2d weight prepacking (>2x pipeline-creation speedup) #3368

[ET-VK] Simplify conv2d weight prepacking (>2x pipeline-creation speedup) #3368

Uh oh!

jorgep31415 commented Apr 26, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 26, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 26, 2024

Uh oh!

facebook-github-bot commented Apr 26, 2024

Uh oh!

facebook-github-bot commented Apr 26, 2024

Uh oh!

Uh oh!

[ET-VK] Simplify conv2d weight prepacking (>2x pipeline-creation speedup) #3368

[ET-VK] Simplify conv2d weight prepacking (>2x pipeline-creation speedup) #3368

Uh oh!

Conversation

jorgep31415 commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3368

✅ No Failures

Uh oh!

facebook-github-bot commented Apr 26, 2024

Uh oh!

facebook-github-bot commented Apr 26, 2024

Uh oh!

facebook-github-bot commented Apr 26, 2024

Uh oh!

Uh oh!

jorgep31415 commented Apr 26, 2024 •

edited

Loading

pytorch-bot bot commented Apr 26, 2024 •

edited

Loading