[ET-VK][EZ] Return if unary_op is out of bounds #2520

jorgep31415 · 2024-03-19T22:03:09Z

Stack from ghstack (oldest at bottom):

-> [ET-VK][EZ] Return if unary_op is out of bounds #2520

Eliminate additional computation when we go out of bounds, which can occur when

global work group size is not a perfect multiple of local work group size, and/or
dynamic shapes are used to reshape the logical tensor data.

Aside on sizes and extents

This is the extension of me struggling on the max_pool2d implementation and hence looking closer at how gpu_sizes_ubo() and extents_ubo() differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying Tensor.*.

If our tensor has

cpu_sizes: (N, C, H, W)

then

gpu_sizes: (W, H, C, N) but the packed-dim size is aligned up to a multiple of 4.
extents: Size of the actual image texture, i.e., gpu_sizes but merging C,N and dividing the packed-dim size by 4.

So we obtain:

WIDTH_PACKED => extents: (W / 4, H, C*N)
HEIGHT_PACKED => extents: (W, H / 4, C*N)
CHANNELS_PACKED => extents: (W, H, C*N / 4)

Hence,

for texture positions, use extents,
for logical coordinates, use gpu_sizes.

Differential Revision: D55097275

Eliminate additional computation when we go out of bounds, which can occur when 1. global work group size is not a perfect multiple of local work group size, and/or 2. dynamic shapes are used to reshape the logical tensor data. ## Aside on sizes and extents This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp). If our tensor has - `cpu_sizes`: (N, C, H, W) then - `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4. - `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension: So we obtain: 1. WIDTH_PACKED => `extents`: (W / 4, H, C*N) 2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N) 3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4) Hence, - for texture positions, use `extents`, - for logical coordinates, use `gpu_sizes`. Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/) [ghstack-poisoned]

pytorch-bot · 2024-03-19T22:03:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2520

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5af2e05 with merge base f5f50b5 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Eliminate additional computation when we go out of bounds, which can occur when 1. global work group size is not a perfect multiple of local work group size, and/or 2. dynamic shapes are used to reshape the logical tensor data. ## Aside on sizes and extents This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp). If our tensor has - `cpu_sizes`: (N, C, H, W) then - `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4. - `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension: So we obtain: 1. WIDTH_PACKED => `extents`: (W / 4, H, C*N) 2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N) 3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4) Hence, - for texture positions, use `extents`, - for logical coordinates, use `gpu_sizes`. Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/) ghstack-source-id: 219284095 Pull Request resolved: #2520

facebook-github-bot · 2024-03-19T22:03:24Z

This pull request was exported from Phabricator. Differential Revision: D55097275

Eliminate additional computation when we go out of bounds, which can occur when 1. global work group size is not a perfect multiple of local work group size, and/or 2. dynamic shapes are used to reshape the logical tensor data. ## Aside on sizes and extents This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp). If our tensor has - `cpu_sizes`: (N, C, H, W) then - `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4. - `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension: So we obtain: 1. WIDTH_PACKED => `extents`: (W / 4, H, C*N) 2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N) 3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4) Hence, - for texture positions, use `extents`, - for logical coordinates, use `gpu_sizes`. Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/) [ghstack-poisoned]

facebook-github-bot · 2024-03-19T22:25:07Z

This pull request was exported from Phabricator. Differential Revision: D55097275

Pull Request resolved: #2520 Eliminate additional computation when we go out of bounds, which can occur when 1. global work group size is not a perfect multiple of local work group size, and/or 2. dynamic shapes are used to reshape the logical tensor data. ## Aside on sizes and extents This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp). If our tensor has - `cpu_sizes`: (N, C, H, W) then - `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4. - `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension: So we obtain: 1. WIDTH_PACKED => `extents`: (W / 4, H, C*N) 2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N) 3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4) Hence, - for texture positions, use `extents`, - for logical coordinates, use `gpu_sizes`. Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/) ghstack-source-id: 219292658

facebook-github-bot · 2024-03-20T00:51:10Z

This pull request has been merged in 0f0c307.

Summary: As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520 We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements. For example, given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements. ## torch.ones(2, 3) ``` (2, 3) = (H, W) = sizes [[1 1 1] [1 1 1]] -> (W, H) = (3, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 1 1 0] [1 1 1 0] ``` ## torch.ones(2, 1) ``` (2, 1) = (H, W) = sizes [[1] [1]] -> (W, H) = (1, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 0 0 0] [1 0 0 0] -> (broadcast from this change) [1 1 1 1] [1 1 1 1] ``` ## torch.ones(2, 3) + torch.ones(2, 1) Ignore the final element of each texel as it's just padding we never read. ``` No broadcast: [1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0] Broadcast: [1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1] ``` Differential Revision: D55278527

Summary: Pull Request resolved: pytorch#2653 As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520 We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements. # Example Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements. ## torch.ones(2, 3) ``` (2, 3) = (H, W) = sizes [[1 1 1] [1 1 1]] -> (W, H) = (3, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 1 1 0] [1 1 1 0] ``` ## torch.ones(2, 1) ``` (2, 1) = (H, W) = sizes [[1] [1]] -> (W, H) = (1, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 0 0 0] [1 0 0 0] -> (broadcast from this change) [1 1 1 1] [1 1 1 1] ``` ## torch.ones(2, 3) + torch.ones(2, 1) Ignore the final element of each texel as it's just padding we never read. ``` No broadcast: [1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0] Broadcast: [1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1] ``` # Cleanup Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error: ``` TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}): Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3] ``` Differential Revision: D55278527

Summary: As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520 We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements. # Example Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements. ## torch.ones(2, 3) ``` (2, 3) = (H, W) = sizes [[1 1 1] [1 1 1]] -> (W, H) = (3, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 1 1 0] [1 1 1 0] ``` ## torch.ones(2, 1) ``` (2, 1) = (H, W) = sizes [[1] [1]] -> (W, H) = (1, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 0 0 0] [1 0 0 0] -> (broadcast from this change) [1 1 1 1] [1 1 1 1] ``` ## torch.ones(2, 3) + torch.ones(2, 1) Ignore the final element of each texel as it's just padding we never read. ``` No broadcast: [1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0] Broadcast: [1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1] ``` # Cleanup Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error: ``` TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}): Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3] ``` Differential Revision: D55278527

Summary: Pull Request resolved: pytorch#2653 As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520 We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements. # Example Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements. ## torch.ones(2, 3) ``` (2, 3) = (H, W) = sizes [[1 1 1] [1 1 1]] -> (W, H) = (3, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 1 1 0] [1 1 1 0] ``` ## torch.ones(2, 1) ``` (2, 1) = (H, W) = sizes [[1] [1]] -> (W, H) = (1, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 0 0 0] [1 0 0 0] -> (broadcast from this change) [1 1 1 1] [1 1 1 1] ``` ## torch.ones(2, 3) + torch.ones(2, 1) Ignore the final element of each texel as it's just padding we never read. ``` No broadcast: [1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0] Broadcast: [1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1] ``` # Cleanup Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error: ``` TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}): Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3] ``` Differential Revision: D55278527

Summary: Pull Request resolved: #2653 As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in #2520 We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements. # Example Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements. ## torch.ones(2, 3) ``` (2, 3) = (H, W) = sizes [[1 1 1] [1 1 1]] -> (W, H) = (3, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 1 1 0] [1 1 1 0] ``` ## torch.ones(2, 1) ``` (2, 1) = (H, W) = sizes [[1] [1]] -> (W, H) = (1, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 0 0 0] [1 0 0 0] -> (broadcast from this change) [1 1 1 1] [1 1 1 1] ``` ## torch.ones(2, 3) + torch.ones(2, 1) Ignore the final element of each texel as it's just padding we never read. ``` No broadcast: [1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0] Broadcast: [1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1] ``` # Cleanup Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error: ``` TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}): Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3] ``` bypass-github-export-checks Reviewed By: SS-JIA Differential Revision: D55278527 fbshipit-source-id: abb8a83924370b21dbbabdd5f1f4af8f502edc1f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2024

facebook-github-bot added the fb-exported label Mar 19, 2024

SS-JIA approved these changes Mar 19, 2024

View reviewed changes

facebook-github-bot closed this in 0f0c307 Mar 20, 2024

facebook-github-bot added the Merged label Mar 20, 2024

jorgep31415 mentioned this pull request Mar 25, 2024

Fix BinaryOp broadcasting for packed dim #2653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK][EZ] Return if unary_op is out of bounds #2520

[ET-VK][EZ] Return if unary_op is out of bounds #2520

Uh oh!

jorgep31415 commented Mar 19, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 19, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 19, 2024

Uh oh!

facebook-github-bot commented Mar 19, 2024

Uh oh!

facebook-github-bot commented Mar 20, 2024

Uh oh!

Uh oh!

[ET-VK][EZ] Return if unary_op is out of bounds #2520

[ET-VK][EZ] Return if unary_op is out of bounds #2520

Uh oh!

Conversation

jorgep31415 commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Aside on sizes and extents

Uh oh!

pytorch-bot bot commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2520

✅ No Failures

Uh oh!

facebook-github-bot commented Mar 19, 2024

Uh oh!

facebook-github-bot commented Mar 19, 2024

Uh oh!

facebook-github-bot commented Mar 20, 2024

Uh oh!

Uh oh!

jorgep31415 commented Mar 19, 2024 •

edited

Loading

pytorch-bot bot commented Mar 19, 2024 •

edited

Loading