Skip to content

[ET-VK] Integrate axis mapping into staging <-> image transfer shaders #5093

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Sep 6, 2024

Conversation

SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Sep 4, 2024

Stack from ghstack (oldest at bottom):

Context

Building on the previous diff, this diff integrates axis mapping into staging <-> image transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.

Differential Revision: D62210117

## Context

Add a simple test to track the sizes of various important objects in the Vulkan compute graph API over time. The test uses some loose thresholds to alert when an object has grown unexpectedly large.

Differential Revision: [D62144400](https://our.internmc.facebook.com/intern/diff/D62144400/)

[ghstack-poisoned]
## Context

Introduce the `SymInt` class which allows representation of symbolic integers in a Vulkan graph.

Please see the comments documentation of the `SymInt` class for more details regarding why the `Int` type is not sufficient for symbolic integers.

Differential Revision: [D62144399](https://our.internmc.facebook.com/intern/diff/D62144399/)

[ghstack-poisoned]
## Context

Normally, tensor memory is planned during the export stage; tensors that do not overlap in lifetimes may share a memory allocation. However, memory planning requires knowledge of the lifetime of the tensors.

However, some complex operators may not be able to perform all the necessary computations in one shader, or the implementation of the operator may require that some temporary tensors be created during the execution of the op. Since these temporary tensors are not visible to the memory planning algorithm, they will not be memory planned.

This diff introduces the `TmpTensorVRef` object which facilitates memory sharing between temporary tensors. The design principle is that the lifetime of temporary tensors is restricted to the execution of the op within which they are created; thus, that knowledge can be used to implement memory planning. Please see the comments documentation of `TmpTensorVRef` for more details.

Differential Revision: [D62144398](https://our.internmc.facebook.com/intern/diff/D62144398/)

[ghstack-poisoned]
…cle temporary tensor memory"

## Context

Normally, tensor memory is planned during the export stage; tensors that do not overlap in lifetimes may share a memory allocation. However, memory planning requires knowledge of the lifetime of the tensors.

However, some complex operators may not be able to perform all the necessary computations in one shader, or the implementation of the operator may require that some temporary tensors be created during the execution of the op. Since these temporary tensors are not visible to the memory planning algorithm, they will not be memory planned.

This diff introduces the `TmpTensorVRef` object which facilitates memory sharing between temporary tensors. The design principle is that the lifetime of temporary tensors is restricted to the execution of the op within which they are created; thus, that knowledge can be used to implement memory planning. Please see the comments documentation of `TmpTensorVRef` for more details.

Differential Revision: [D62144398](https://our.internmc.facebook.com/intern/diff/D62144398/)

[ghstack-poisoned]
…nsor memory"

## Context

Normally, tensor memory is planned during the export stage; tensors that do not overlap in lifetimes may share a memory allocation. However, memory planning requires knowledge of the lifetime of the tensors.

However, some complex operators may not be able to perform all the necessary computations in one shader, or the implementation of the operator may require that some temporary tensors be created during the execution of the op. Since these temporary tensors are not visible to the memory planning algorithm, they will not be memory planned.

This diff introduces the `TmpTensorVRef` object which facilitates memory sharing between temporary tensors. The design principle is that the lifetime of temporary tensors is restricted to the execution of the op within which they are created; thus, that knowledge can be used to implement memory planning. Please see the comments documentation of `TmpTensorVRef` for more details.

Differential Revision: [D62144398](https://our.internmc.facebook.com/intern/diff/D62144398/)

[ghstack-poisoned]
…cle temporary tensor memory"

## Context

Normally, tensor memory is planned during the export stage; tensors that do not overlap in lifetimes may share a memory allocation. However, memory planning requires knowledge of the lifetime of the tensors.

However, some complex operators may not be able to perform all the necessary computations in one shader, or the implementation of the operator may require that some temporary tensors be created during the execution of the op. Since these temporary tensors are not visible to the memory planning algorithm, they will not be memory planned.

This diff introduces the `TmpTensorVRef` object which facilitates memory sharing between temporary tensors. The design principle is that the lifetime of temporary tensors is restricted to the execution of the op within which they are created; thus, that knowledge can be used to implement memory planning. Please see the comments documentation of `TmpTensorVRef` for more details.

Differential Revision: [D62144398](https://our.internmc.facebook.com/intern/diff/D62144398/)

[ghstack-poisoned]
…nsor memory"

## Context

Normally, tensor memory is planned during the export stage; tensors that do not overlap in lifetimes may share a memory allocation. However, memory planning requires knowledge of the lifetime of the tensors.

However, some complex operators may not be able to perform all the necessary computations in one shader, or the implementation of the operator may require that some temporary tensors be created during the execution of the op. Since these temporary tensors are not visible to the memory planning algorithm, they will not be memory planned.

This diff introduces the `TmpTensorVRef` object which facilitates memory sharing between temporary tensors. The design principle is that the lifetime of temporary tensors is restricted to the execution of the op within which they are created; thus, that knowledge can be used to implement memory planning. Please see the comments documentation of `TmpTensorVRef` for more details.

Differential Revision: [D62144398](https://our.internmc.facebook.com/intern/diff/D62144398/)

[ghstack-poisoned]
…cle temporary tensor memory"

## Context

Normally, tensor memory is planned during the export stage; tensors that do not overlap in lifetimes may share a memory allocation. However, memory planning requires knowledge of the lifetime of the tensors.

However, some complex operators may not be able to perform all the necessary computations in one shader, or the implementation of the operator may require that some temporary tensors be created during the execution of the op. Since these temporary tensors are not visible to the memory planning algorithm, they will not be memory planned.

This diff introduces the `TmpTensorVRef` object which facilitates memory sharing between temporary tensors. The design principle is that the lifetime of temporary tensors is restricted to the execution of the op within which they are created; thus, that knowledge can be used to implement memory planning. Please see the comments documentation of `TmpTensorVRef` for more details.

Differential Revision: [D62144398](https://our.internmc.facebook.com/intern/diff/D62144398/)

[ghstack-poisoned]
…nsor memory"

## Context

Normally, tensor memory is planned during the export stage; tensors that do not overlap in lifetimes may share a memory allocation. However, memory planning requires knowledge of the lifetime of the tensors.

However, some complex operators may not be able to perform all the necessary computations in one shader, or the implementation of the operator may require that some temporary tensors be created during the execution of the op. Since these temporary tensors are not visible to the memory planning algorithm, they will not be memory planned.

This diff introduces the `TmpTensorVRef` object which facilitates memory sharing between temporary tensors. The design principle is that the lifetime of temporary tensors is restricted to the execution of the op within which they are created; thus, that knowledge can be used to implement memory planning. Please see the comments documentation of `TmpTensorVRef` for more details.

Differential Revision: [D62144398](https://our.internmc.facebook.com/intern/diff/D62144398/)

[ghstack-poisoned]
## Context

Currently, in shaders we have to declare the binding slot that layout bindings will bind to explicitly, i.e.

```
${layout_declare_tensor(0, "w", "t_out", DTYPE, STORAGE)}
${layout_declare_buffer(1, "r", "nchw_in", DTYPE)}
${layout_declare_ubo(2, "ivec4", "sizes")}
```

However, this can get a little tedious when making many layout declarations. This diff improves the situation by adding the `B` variable which will automatically increment the binding slot whenever a layout binding is declared. Now we can write

```
${layout_declare_tensor(B, "w", "t_out", DTYPE, STORAGE)}
${layout_declare_buffer(B, "r", "nchw_in", DTYPE)}
${layout_declare_ubo(B, "ivec4", "sizes")}
```

I may make a follow up diff to change all layout declarations to use `B` across all shaders in the codebase later on.

Differential Revision: [D62210119](https://our.internmc.facebook.com/intern/diff/D62210119/)

[ghstack-poisoned]
…tensors

## Context

This diff introduces the `axis_mapping` field for `vTensors`, which can be used to implement no-copy permutes. The idea behind the axis mapping is that it is somewhat analogous to dim order for texture backed tensors.

The axis mapping is normalized to 4 dimensions, similar to padded sizes. The first 3 elements indicates which of the (X,Y,Z) image texture axes the width, height, and channels dim of the tensor maps to. The final element indicates the WHCN index of the tensor dimension along which batches will be concatenated.

The benefit of introducing axis mapping is twofold:

1. Permutes can be performed without any data copying by re-using a texture but updating the axis mapping.
2. Allows the memory layout of texture backed tensors to be more flexible, and optimize for performance or memory footprint by using unconventional axis mappings.

Regarding the second point, we have found that adding length to a texture's Z axis is more costly than adding length to the texture's X or Y axes. Similarly, we have found that reading along the Z axis yeilds slightly lower throughput than reading along the X or Y axes. By introducing axis mapping, we can map the largest dimension to a texture's X axis instead of mapping it to the most intuitive texture axis (i.e. channels to Z axis). This can save a lot of texture memory and potentially improve compute shader latency as well.

However, the pre-requisite of using texture mapping heavily is that the overhead introduced in calculating tensor indices and texture positions does not significantly increase compute shader latency. The impact of this will be investigated and shown in the following diffs.

Note that this diff only introduces the `axis_mapping` field;

Differential Revision: [D62210118](https://our.internmc.facebook.com/intern/diff/D62210118/)

[ghstack-poisoned]
## Context

Building on the previous diff, this diff integrates axis mapping into staging <-> buffer transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.

Differential Revision: [D62210117](https://our.internmc.facebook.com/intern/diff/D62210117/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Sep 4, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5093

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit a2ae8dd with merge base 9739609 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 4, 2024
SS-JIA added a commit that referenced this pull request Sep 4, 2024
## Context

Building on the previous diff, this diff integrates axis mapping into staging <-> buffer transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.

Differential Revision: [D62210117](https://our.internmc.facebook.com/intern/diff/D62210117/)

ghstack-source-id: 241066644
Pull Request resolved: #5093
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62210117

@SS-JIA SS-JIA changed the base branch from gh/SS-JIA/70/base to gh/SS-JIA/69/head September 4, 2024 22:05
…g <-> buffer transfer shaders"

## Context

Building on the previous diff, this diff integrates axis mapping into staging <-> buffer transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.

Differential Revision: [D62210117](https://our.internmc.facebook.com/intern/diff/D62210117/)

[ghstack-poisoned]
…nsfer shaders"

## Context

Building on the previous diff, this diff integrates axis mapping into staging <-> buffer transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.

Differential Revision: [D62210117](https://our.internmc.facebook.com/intern/diff/D62210117/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62210117

SS-JIA added a commit that referenced this pull request Sep 5, 2024
Pull Request resolved: #5093

## Context

Building on the previous diff, this diff integrates axis mapping into staging <-> image transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.

Differential Revision: [D62210117](https://our.internmc.facebook.com/intern/diff/D62210117/)
ghstack-source-id: 241249802
…g <-> buffer transfer shaders"

## Context

Building on the previous diff, this diff integrates axis mapping into staging <-> buffer transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.

Differential Revision: [D62210117](https://our.internmc.facebook.com/intern/diff/D62210117/)

[ghstack-poisoned]
…nsfer shaders"

## Context

Building on the previous diff, this diff integrates axis mapping into staging <-> buffer transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.

Differential Revision: [D62210117](https://our.internmc.facebook.com/intern/diff/D62210117/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62210117

SS-JIA added a commit that referenced this pull request Sep 6, 2024
Pull Request resolved: #5093

## Context

Building on the previous diff, this diff integrates axis mapping into staging <-> image transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.
ghstack-source-id: 241282078

Differential Revision: [D62210117](https://our.internmc.facebook.com/intern/diff/D62210117/)
@SS-JIA SS-JIA changed the title [ET-VK] Integrate axis mapping into staging <-> buffer transfer shaders [ET-VK] Integrate axis mapping into staging <-> image transfer shaders Sep 6, 2024
Base automatically changed from gh/SS-JIA/69/head to main September 6, 2024 03:28
…g <-> image transfer shaders"


## Context

Building on the previous diff, this diff integrates axis mapping into staging <-> image transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.

Differential Revision: [D62210117](https://our.internmc.facebook.com/intern/diff/D62210117/)

[ghstack-poisoned]
…sfer shaders"


## Context

Building on the previous diff, this diff integrates axis mapping into staging <-> image transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.

Differential Revision: [D62210117](https://our.internmc.facebook.com/intern/diff/D62210117/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Sep 6, 2024
Pull Request resolved: #5093

## Context

Building on the previous diff, this diff integrates axis mapping into staging <-> image transfer shaders. Alternative versions of indexing utility functions are introduced to account for axis mapping.

The impact of shader latency of using axis mapping on transfer shaders is examined in the next diff.
ghstack-source-id: 241354024

Differential Revision: [D62210117](https://our.internmc.facebook.com/intern/diff/D62210117/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62210117

@facebook-github-bot facebook-github-bot merged commit 41ec7fa into main Sep 6, 2024
36 of 38 checks passed
@facebook-github-bot facebook-github-bot deleted the gh/SS-JIA/70/head branch September 6, 2024 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants