Skip to content

[ET-VK] Simplify conv2d weight prepacking (>2x pipeline-creation speedup) #3368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jorgep31415
Copy link
Contributor

@jorgep31415 jorgep31415 commented Apr 26, 2024

Stack from ghstack (oldest at bottom):

@SSJia has previously written two implementations of convolution weights prepacking for CPU (before and after PyTorch PR #84973). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps.

The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms:

[334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)

The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms:

[135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)

Differential Revision: D56617129

…edup)

@SSJia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps.

The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms:
```
[334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)
```

The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms:
```
[135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)
```

Internal:
This diff targets the Next Steps involving convolution from [ET-VK Model-Load Benchmarks](https://docs.google.com/document/d/11JIBPuCI-u6Xe15GKzFC8pQaEW5F3ipBljWm7Nu_1KM/edit#heading=h.hlhgkp1f0o05) to reduce model-load time.

Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Apr 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3368

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 999b6fd with merge base 44d4bac (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 26, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56617129

jorgep31415 added a commit that referenced this pull request Apr 26, 2024
…edup)

ssjia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps.

The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms:
```
[334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)
```

The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms:
```
[135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)
```

Internal:
This diff targets the Next Steps involving convolution from [ET-VK Model-Load Benchmarks](https://docs.google.com/document/d/11JIBPuCI-u6Xe15GKzFC8pQaEW5F3ipBljWm7Nu_1KM/edit#heading=h.hlhgkp1f0o05) to reduce model-load time.

Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/)

ghstack-source-id: 224037564
Pull Request resolved: #3368
@jorgep31415 jorgep31415 changed the title [ML-GPU] Simplify conv2d weight prepacking (>2x pipeline-creation speedup) Simplify conv2d weight prepacking (>2x pipeline-creation speedup) Apr 26, 2024
@jorgep31415 jorgep31415 changed the title Simplify conv2d weight prepacking (>2x pipeline-creation speedup) [ET-VK] Simplify conv2d weight prepacking (>2x pipeline-creation speedup) Apr 26, 2024
@SS-JIA SS-JIA self-requested a review April 26, 2024 16:35
…eation speedup)"


ssjia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps.

The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms:
```
[334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)
```

The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms:
```
[135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)
```

Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56617129

jorgep31415 added a commit that referenced this pull request Apr 26, 2024
…dup)

Pull Request resolved: #3368

@SSJia has previously written two implementations of convolution weights prepacking for CPU (before and after [PyTorch PR #84973](pytorch/pytorch#84973)). Originally, I translated the second implementation to GPU since it is more readable. Now, I translate the first implementation to GPU and switch to it since it requires less steps.

Internal:

This diff targets the Next Steps involving convolution from [ET-VK Model-Load Benchmarks](https://docs.google.com/document/d/11JIBPuCI-u6Xe15GKzFC8pQaEW5F3ipBljWm7Nu_1KM/edit#heading=h.hlhgkp1f0o05) to reduce model-load time.

## Before

The second impl was so complex that during model-load, it took >1500ms to create pipelines. In the test plan's Before, the example sums to 1905ms:
```
[334ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[110ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[270ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[609ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[488ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)
```

Shader file sizes were 3-5 KB. This is likely misleading for SPIR-V comparison since it includes very long comments.
```
[[email protected] ~/scratch/shaders]$ ls -l conv*_prepack_weights_float.glsl
-rw-r--r-- 1 jorgep31415 users 3579 Apr 26 12:59 conv2d_dw_prepack_weights_float.glsl
-rw-r--r-- 1 jorgep31415 users 4661 Apr 26 12:59 conv2d_prepack_weights_float.glsl
-rw-r--r-- 1 jorgep31415 users 3960 Apr 26 12:59 conv_transpose2d_prepack_weights_float.glsl
```

## After

The first impl now takes <700ms to create pipelines. In the test plan's After, the example sums to 598ms:
```
[135ms] P::encode-conv2d_prepack_weights_float, (16, 4, 1)
[83ms] P::encode-conv2d_dw_prepack_weights_float, (16, 4, 1)
[102ms] P::encode-conv2d_prepack_weights_float, (8, 8, 1)
[69ms] P::encode-conv2d_dw_prepack_weights_float, (8, 8, 1)
[115ms] P::encode-conv_transpose2d_prepack_weights_float, (8, 8, 1)
[94ms] P::encode-conv_transpose2d_prepack_weights_float, (16, 4, 1)
```

Shader file sizes are ~2.5 KB each.
```
[[email protected] ~/scratch/shaders]$ ls -l conv*_prepack_weights_float.glsl
-rw-r--r-- 1 jorgep31415 users 2443 Apr 26 12:53 conv2d_dw_prepack_weights_float.glsl
-rw-r--r-- 1 jorgep31415 users 2621 Apr 26 12:53 conv2d_prepack_weights_float.glsl
-rw-r--r-- 1 jorgep31415 users 2522 Apr 26 12:53 conv_transpose2d_prepack_weights_float.glsl
```

Differential Revision: [D56617129](https://our.internmc.facebook.com/intern/diff/D56617129/)
ghstack-source-id: 224133695
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6c06f26.

jorgep31415 added a commit that referenced this pull request Apr 30, 2024
TIL smoke tests are not part of the CI.

Forgot to update this in #3368

Differential Revision: [D56739385](https://our.internmc.facebook.com/intern/diff/D56739385/)

[ghstack-poisoned]
jorgep31415 added a commit that referenced this pull request Apr 30, 2024
TIL smoke tests are not part of the CI.

Forgot to update this in #3368

Differential Revision: [D56739385](https://our.internmc.facebook.com/intern/diff/D56739385/)

ghstack-source-id: 224416140
Pull Request resolved: #3415
facebook-github-bot pushed a commit that referenced this pull request Apr 30, 2024
Summary:
Pull Request resolved: #3415

TIL smoke tests are not part of the CI.

Forgot to update this in #3368

Reviewed By: yipjustin

Differential Revision: D56739385

fbshipit-source-id: af45047d59ce1da873bf8d7bd8e68d5b07a31184
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants