Skip to content

[ET-VK][EZ] Return if unary_op is out of bounds #2520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jorgep31415
Copy link
Contributor

@jorgep31415 jorgep31415 commented Mar 19, 2024

Stack from ghstack (oldest at bottom):

Eliminate additional computation when we go out of bounds, which can occur when

  1. global work group size is not a perfect multiple of local work group size, and/or
  2. dynamic shapes are used to reshape the logical tensor data.

Aside on sizes and extents

This is the extension of me struggling on the max_pool2d implementation and hence looking closer at how gpu_sizes_ubo() and extents_ubo() differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying Tensor.*.

If our tensor has

  • cpu_sizes: (N, C, H, W)

then

  • gpu_sizes: (W, H, C, N) but the packed-dim size is aligned up to a multiple of 4.
  • extents: Size of the actual image texture, i.e., gpu_sizes but merging C,N and dividing the packed-dim size by 4.

So we obtain:

  1. WIDTH_PACKED => extents: (W / 4, H, C*N)
  2. HEIGHT_PACKED => extents: (W, H / 4, C*N)
  3. CHANNELS_PACKED => extents: (W, H, C*N / 4)

Hence,

  • for texture positions, use extents,
  • for logical coordinates, use gpu_sizes.

Differential Revision: D55097275

Eliminate additional computation when we go out of bounds, which can occur when
1. global work group size is not a perfect multiple of local work group size, and/or
2. dynamic shapes are used to reshape the logical tensor data.

## Aside on sizes and extents

This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp).

If our tensor has
- `cpu_sizes`: (N, C, H, W)

then
- `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4.
- `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension:

So we obtain:
1. WIDTH_PACKED => `extents`: (W / 4, H, C*N)
2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N)
3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4)

Hence,
- for texture positions, use `extents`,
- for logical coordinates, use `gpu_sizes`.

Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Mar 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2520

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5af2e05 with merge base f5f50b5 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2024
jorgep31415 added a commit that referenced this pull request Mar 19, 2024
Eliminate additional computation when we go out of bounds, which can occur when
1. global work group size is not a perfect multiple of local work group size, and/or
2. dynamic shapes are used to reshape the logical tensor data.

## Aside on sizes and extents

This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp).

If our tensor has
- `cpu_sizes`: (N, C, H, W)

then
- `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4.
- `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension:

So we obtain:
1. WIDTH_PACKED => `extents`: (W / 4, H, C*N)
2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N)
3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4)

Hence,
- for texture positions, use `extents`,
- for logical coordinates, use `gpu_sizes`.

Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/)

ghstack-source-id: 219284095
Pull Request resolved: #2520
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55097275

Eliminate additional computation when we go out of bounds, which can occur when
1. global work group size is not a perfect multiple of local work group size, and/or
2. dynamic shapes are used to reshape the logical tensor data.

## Aside on sizes and extents

This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp).

If our tensor has
- `cpu_sizes`: (N, C, H, W)

then
- `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4.
- `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension:

So we obtain:
1. WIDTH_PACKED => `extents`: (W / 4, H, C*N)
2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N)
3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4)

Hence,
- for texture positions, use `extents`,
- for logical coordinates, use `gpu_sizes`.

Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55097275

jorgep31415 added a commit that referenced this pull request Mar 19, 2024
Pull Request resolved: #2520

Eliminate additional computation when we go out of bounds, which can occur when
1. global work group size is not a perfect multiple of local work group size, and/or
2. dynamic shapes are used to reshape the logical tensor data.

## Aside on sizes and extents

This is the extension of me struggling on the `max_pool2d` implementation and hence looking closer at how `gpu_sizes_ubo()` and `extents_ubo()` differ. I'm writing this summary to explain this to future me when I inevitably forget. This knowledge can be obtained by studying [`Tensor.*`](https://github.com/pytorch/pytorch/blob/cceabe873f11c6611f627a3bb0055994952ec6b8/aten/src/ATen/native/vulkan/api/Tensor.cpp).

If our tensor has
- `cpu_sizes`: (N, C, H, W)

then
- `gpu_sizes`: (W, H, C, N) but each size is aligned up to a multiple of 4.
- `extents`: Size of the actual image texture, i.e., gpu_sizes but merging C,N and packing the corresponding dimension:

So we obtain:
1. WIDTH_PACKED => `extents`: (W / 4, H, C*N)
2. HEIGHT_PACKED => `extents`: (W, H / 4, C*N)
3. CHANNELS_PACKED => `extents`: (W, H, C*N / 4)

Hence,
- for texture positions, use `extents`,
- for logical coordinates, use `gpu_sizes`.

Differential Revision: [D55097275](https://our.internmc.facebook.com/intern/diff/D55097275/)
ghstack-source-id: 219292658
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 0f0c307.

jorgep31415 added a commit to jorgep31415/executorch that referenced this pull request Mar 25, 2024
Summary:
As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520

We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements.


For example, given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements.

## torch.ones(2, 3)
```
(2, 3) = (H, W) = sizes
[[1 1 1]
 [1 1 1]]
-> (W, H) = (3, 2) → (4, 2) = gpu_sizes
-> extents = (1, 2)
[1 1 1 0] [1 1 1 0]
```

## torch.ones(2, 1)
```
(2, 1)  = (H, W) = sizes
[[1]
 [1]]
-> (W, H) = (1, 2) → (4, 2) = gpu_sizes
-> extents = (1, 2)
[1 0 0 0] [1 0 0 0]
-> (broadcast from this change)
[1 1 1 1] [1 1 1 1]
```

## torch.ones(2, 3) + torch.ones(2, 1)
Ignore the final element of each texel as it's just padding we never read.
```
No broadcast:
[1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0]

Broadcast:
[1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1]
```

Differential Revision: D55278527
jorgep31415 added a commit to jorgep31415/executorch that referenced this pull request Mar 25, 2024
Summary:
Pull Request resolved: pytorch#2653

As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520

We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements.

# Example

Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements.

## torch.ones(2, 3)
```
(2, 3) = (H, W) = sizes
[[1 1 1]
 [1 1 1]]
-> (W, H) = (3, 2) → (4, 2) = gpu_sizes
-> extents = (1, 2)
[1 1 1 0] [1 1 1 0]
```

## torch.ones(2, 1)
```
(2, 1)  = (H, W) = sizes
[[1]
 [1]]
-> (W, H) = (1, 2) → (4, 2) = gpu_sizes
-> extents = (1, 2)
[1 0 0 0] [1 0 0 0]
-> (broadcast from this change)
[1 1 1 1] [1 1 1 1]
```

## torch.ones(2, 3) + torch.ones(2, 1)
Ignore the final element of each texel as it's just padding we never read.
```
No broadcast:
[1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0]

Broadcast:
[1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1]
```

# Cleanup

Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error:
```
TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}):
Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3]
```

Differential Revision: D55278527
jorgep31415 added a commit to jorgep31415/executorch that referenced this pull request Mar 25, 2024
Summary:

As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520

We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements.

# Example

Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements.

## torch.ones(2, 3)
```
(2, 3) = (H, W) = sizes
[[1 1 1]
 [1 1 1]]
-> (W, H) = (3, 2) → (4, 2) = gpu_sizes
-> extents = (1, 2)
[1 1 1 0] [1 1 1 0]
```

## torch.ones(2, 1)
```
(2, 1)  = (H, W) = sizes
[[1]
 [1]]
-> (W, H) = (1, 2) → (4, 2) = gpu_sizes
-> extents = (1, 2)
[1 0 0 0] [1 0 0 0]
-> (broadcast from this change)
[1 1 1 1] [1 1 1 1]
```

## torch.ones(2, 3) + torch.ones(2, 1)
Ignore the final element of each texel as it's just padding we never read.
```
No broadcast:
[1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0]

Broadcast:
[1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1]
```

# Cleanup

Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error:
```
TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}):
Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3]
```

Differential Revision: D55278527
jorgep31415 added a commit to jorgep31415/executorch that referenced this pull request Mar 25, 2024
Summary:
Pull Request resolved: pytorch#2653

As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in pytorch#2520

We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements.

# Example

Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements.

## torch.ones(2, 3)
```
(2, 3) = (H, W) = sizes
[[1 1 1]
 [1 1 1]]
-> (W, H) = (3, 2) → (4, 2) = gpu_sizes
-> extents = (1, 2)
[1 1 1 0] [1 1 1 0]
```

## torch.ones(2, 1)
```
(2, 1)  = (H, W) = sizes
[[1]
 [1]]
-> (W, H) = (1, 2) → (4, 2) = gpu_sizes
-> extents = (1, 2)
[1 0 0 0] [1 0 0 0]
-> (broadcast from this change)
[1 1 1 1] [1 1 1 1]
```

## torch.ones(2, 3) + torch.ones(2, 1)
Ignore the final element of each texel as it's just padding we never read.
```
No broadcast:
[1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0]

Broadcast:
[1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1]
```

# Cleanup

Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error:
```
TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}):
Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3]
```

Differential Revision: D55278527
facebook-github-bot pushed a commit that referenced this pull request Mar 27, 2024
Summary:
Pull Request resolved: #2653

As copyrightly pointed out, broadcasting was not working properly for the example below. I root caused the to confusion between `sizes()` vs `gpu_sizes()` once again! These concepts are explained in #2520

We should use the CPU size, not the GPU size to detect when we should broadcast across the packed-dim texel's elements.

# Example

Given inputs `torch.ones(2, 3)` and `torch.ones(2, 1)` and `GPUMemoryLayout::WIDTH_PACKED`, we have CPU widths 3 and 1, respectively. These are aligned up to GPU widths 4 and 4, and hence we were failing to broadcast along the packed-dim texel's elements.

## torch.ones(2, 3)
```
(2, 3) = (H, W) = sizes
[[1 1 1]
 [1 1 1]]
-> (W, H) = (3, 2) → (4, 2) = gpu_sizes
-> extents = (1, 2)
[1 1 1 0] [1 1 1 0]
```

## torch.ones(2, 1)
```
(2, 1)  = (H, W) = sizes
[[1]
 [1]]
-> (W, H) = (1, 2) → (4, 2) = gpu_sizes
-> extents = (1, 2)
[1 0 0 0] [1 0 0 0]
-> (broadcast from this change)
[1 1 1 1] [1 1 1 1]
```

## torch.ones(2, 3) + torch.ones(2, 1)
Ignore the final element of each texel as it's just padding we never read.
```
No broadcast:
[1 1 1 0] [1 1 1 0] + [1 0 0 0] [1 0 0 0] = [2 1 1 0] [2 1 1 0]

Broadcast:
[1 1 1 0] [1 1 1 0] + [1 1 1 1] [1 1 1 1] = [2 2 2 1] [2 2 2 1]
```

# Cleanup

Remove unneeded `check_broadcastable()` since this is caught earlier in the PyTorch compiler pipeline. For example, `torch.ones(2, 3) + torch.ones(2, 2)` triggers this error:
```
TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2, 3)), FakeTensor(..., size=(2, 2))), **{}):
Attempting to broadcast a dimension of length 2 at -1! Mismatching argument at index 1 had torch.Size([2, 2]); but expected shape should be broadcastable to [2, 3]
```

bypass-github-export-checks

Reviewed By: SS-JIA

Differential Revision: D55278527

fbshipit-source-id: abb8a83924370b21dbbabdd5f1f4af8f502edc1f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants