-
Notifications
You must be signed in to change notification settings - Fork 608
[ET-VK][EZ] Return if unary_op is out of bounds #2520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Eliminate additional computation when we go out of bounds, which can occur when 1. global work group size is not a perfect multiple of local work group size, and/or 2. dynamic shapes are used to reshape the logical tensor data. ## Aside on sizes and extents This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp). If our tensor has - `cpu_sizes`: (N, C, H, W) then - `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4. - `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension: So we obtain: 1. WIDTH_PACKED => `extents`: (W / 4, H, C*N) 2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N) 3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4) Hence, - for texture positions, use `extents`, - for logical coordinates, use `gpu_sizes`. Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2520
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5af2e05 with merge base f5f50b5 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Eliminate additional computation when we go out of bounds, which can occur when 1. global work group size is not a perfect multiple of local work group size, and/or 2. dynamic shapes are used to reshape the logical tensor data. ## Aside on sizes and extents This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp). If our tensor has - `cpu_sizes`: (N, C, H, W) then - `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4. - `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension: So we obtain: 1. WIDTH_PACKED => `extents`: (W / 4, H, C*N) 2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N) 3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4) Hence, - for texture positions, use `extents`, - for logical coordinates, use `gpu_sizes`. Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/) ghstack-source-id: 219284095 Pull Request resolved: #2520
This pull request was exported from Phabricator. Differential Revision: D55097275 |
Eliminate additional computation when we go out of bounds, which can occur when 1. global work group size is not a perfect multiple of local work group size, and/or 2. dynamic shapes are used to reshape the logical tensor data. ## Aside on sizes and extents This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp). If our tensor has - `cpu_sizes`: (N, C, H, W) then - `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4. - `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension: So we obtain: 1. WIDTH_PACKED => `extents`: (W / 4, H, C*N) 2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N) 3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4) Hence, - for texture positions, use `extents`, - for logical coordinates, use `gpu_sizes`. Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D55097275 |
Pull Request resolved: #2520 Eliminate additional computation when we go out of bounds, which can occur when 1. global work group size is not a perfect multiple of local work group size, and/or 2. dynamic shapes are used to reshape the logical tensor data. ## Aside on sizes and extents This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp). If our tensor has - `cpu_sizes`: (N, C, H, W) then - `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4. - `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension: So we obtain: 1. WIDTH_PACKED => `extents`: (W / 4, H, C*N) 2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N) 3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4) Hence, - for texture positions, use `extents`, - for logical coordinates, use `gpu_sizes`. Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/) ghstack-source-id: 219292658
This pull request has been merged in 0f0c307. |
Summary: As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520 We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements. For example, given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements. ## torch.ones(2, 3) ``` (2, 3) = (H, W) = sizes [[1 1 1] [1 1 1]] -> (W, H) = (3, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 1 1 0] [1 1 1 0] ``` ## torch.ones(2, 1) ``` (2, 1) = (H, W) = sizes [[1] [1]] -> (W, H) = (1, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 0 0 0] [1 0 0 0] -> (broadcast from this change) [1 1 1 1] [1 1 1 1] ``` ## torch.ones(2, 3) + torch.ones(2, 1) Ignore the final element of each texel as it's just padding we never read. ``` No broadcast: [1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0] Broadcast: [1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1] ``` Differential Revision: D55278527
Summary: Pull Request resolved: pytorch#2653 As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520 We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements. # Example Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements. ## torch.ones(2, 3) ``` (2, 3) = (H, W) = sizes [[1 1 1] [1 1 1]] -> (W, H) = (3, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 1 1 0] [1 1 1 0] ``` ## torch.ones(2, 1) ``` (2, 1) = (H, W) = sizes [[1] [1]] -> (W, H) = (1, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 0 0 0] [1 0 0 0] -> (broadcast from this change) [1 1 1 1] [1 1 1 1] ``` ## torch.ones(2, 3) + torch.ones(2, 1) Ignore the final element of each texel as it's just padding we never read. ``` No broadcast: [1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0] Broadcast: [1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1] ``` # Cleanup Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error: ``` TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}): Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3] ``` Differential Revision: D55278527
Summary: As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520 We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements. # Example Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements. ## torch.ones(2, 3) ``` (2, 3) = (H, W) = sizes [[1 1 1] [1 1 1]] -> (W, H) = (3, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 1 1 0] [1 1 1 0] ``` ## torch.ones(2, 1) ``` (2, 1) = (H, W) = sizes [[1] [1]] -> (W, H) = (1, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 0 0 0] [1 0 0 0] -> (broadcast from this change) [1 1 1 1] [1 1 1 1] ``` ## torch.ones(2, 3) + torch.ones(2, 1) Ignore the final element of each texel as it's just padding we never read. ``` No broadcast: [1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0] Broadcast: [1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1] ``` # Cleanup Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error: ``` TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}): Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3] ``` Differential Revision: D55278527
Summary: Pull Request resolved: pytorch#2653 As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520 We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements. # Example Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements. ## torch.ones(2, 3) ``` (2, 3) = (H, W) = sizes [[1 1 1] [1 1 1]] -> (W, H) = (3, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 1 1 0] [1 1 1 0] ``` ## torch.ones(2, 1) ``` (2, 1) = (H, W) = sizes [[1] [1]] -> (W, H) = (1, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 0 0 0] [1 0 0 0] -> (broadcast from this change) [1 1 1 1] [1 1 1 1] ``` ## torch.ones(2, 3) + torch.ones(2, 1) Ignore the final element of each texel as it's just padding we never read. ``` No broadcast: [1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0] Broadcast: [1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1] ``` # Cleanup Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error: ``` TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}): Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3] ``` Differential Revision: D55278527
Summary: Pull Request resolved: #2653 As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in #2520 We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements. # Example Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements. ## torch.ones(2, 3) ``` (2, 3) = (H, W) = sizes [[1 1 1] [1 1 1]] -> (W, H) = (3, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 1 1 0] [1 1 1 0] ``` ## torch.ones(2, 1) ``` (2, 1) = (H, W) = sizes [[1] [1]] -> (W, H) = (1, 2) → (4, 2) = gpu_sizes -> extents = (1, 2) [1 0 0 0] [1 0 0 0] -> (broadcast from this change) [1 1 1 1] [1 1 1 1] ``` ## torch.ones(2, 3) + torch.ones(2, 1) Ignore the final element of each texel as it's just padding we never read. ``` No broadcast: [1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0] Broadcast: [1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1] ``` # Cleanup Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error: ``` TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}): Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3] ``` bypass-github-export-checks Reviewed By: SS-JIA Differential Revision: D55278527 fbshipit-source-id: abb8a83924370b21dbbabdd5f1f4af8f502edc1f
Stack from ghstack (oldest at bottom):
Eliminate additional computation when we go out of bounds, which can occur when
Aside on sizes and extents
This is the extension of me struggling on the
max_pool2d
implementation and hence looking closer at howgpu_sizes_ubo()
andextents_ubo()
differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studyingTensor.*
.If our tensor has
cpu_sizes
: (N, C, H, W)then
gpu_sizes
: (W, H, C, N) but the packed-dim size is aligned up to a multiple of 4.extents
: Size of the actual image texture, i.e., gpu_sizes but merging C,N and dividing the packed-dim size by 4.So we obtain:
extents
: (W / 4, H, C*N)extents
: (W, H / 4, C*N)extents
: (W, H, C*N / 4)Hence,
extents
,gpu_sizes
.Differential Revision: D55097275