-
Notifications
You must be signed in to change notification settings - Fork 608
RegCount max registers calculation #4171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4171
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D59494644 |
Summary: Pull Request resolved: pytorch#4171 This project adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This diffs finds the number of registers in a single thread by increasing the number of registers and finding changes in latency. For a Galaxy S22, the latency graph looks like this. {F1750619092} We can easily spot the moment where there is a spill into memory. Differential Revision: D59494644 Reviewed By: SS-JIA
Summary: Pull Request resolved: pytorch#4171 This project adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This diffs finds the number of registers in a single thread by increasing the number of registers and finding changes in latency. For a Galaxy S22, the latency graph looks like this. {F1750619092} We can easily spot the moment where there is a spill into memory. Differential Revision: D59494644 Reviewed By: SS-JIA
Summary: Pull Request resolved: pytorch#4171 This project adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This diffs finds the number of registers in a single thread by increasing the number of registers and finding changes in latency. For a Galaxy S22, the latency graph looks like this. {F1750619092} We can easily spot the moment where there is a spill into memory. Differential Revision: D59494644 Reviewed By: SS-JIA
This pull request was exported from Phabricator. Differential Revision: D59494644 |
Summary: Pull Request resolved: pytorch#4171 This project adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This diffs finds the number of registers in a single thread by increasing the number of registers and finding changes in latency. For a Galaxy S22, the latency graph looks like this. {F1751910439} We can easily spot the moment where there is a spill into memory. Reviewed By: SS-JIA Differential Revision: D59494644
1c3d9d2
to
037b5de
Compare
This pull request was exported from Phabricator. Differential Revision: D59494644 |
Summary: Pull Request resolved: pytorch#4171 This project adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This diffs finds the number of registers in a single thread by increasing the number of registers and finding changes in latency. For a Galaxy S22, the latency graph looks like this. {F1751910439} We can easily spot the moment where there is a spill into memory. Reviewed By: SS-JIA Differential Revision: D59494644
037b5de
to
1813297
Compare
Summary: Pull Request resolved: pytorch#4171 This project adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This diffs finds the number of registers in a single thread by increasing the number of registers and finding changes in latency. For a Galaxy S22, the latency graph looks like this. {F1750619092} We can easily spot the moment where there is a spill into memory. Differential Revision: D59494644 Reviewed By: SS-JIA
Summary: Pull Request resolved: pytorch#4159 This adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests. The kernel looks like the following for any K number of registers: float reg_data0 = float(niter) + 0; float reg_data1 = float(niter) + 1; ... float reg_dataK = float(niter) + K; int i = 0; for (; i < niter; ++i) { reg_data0 *= reg_dataK; reg_data1 *= reg_data0; reg_data2 *= reg_data1; ... reg_dataK *= reg_data(K-1); } i = i >> 31; buffer_out.data[0 * i] = reg_data0; buffer_out.data[1 * i] = reg_data1; ... buffer_out.data[K * i] = reg_dataK; Differential Revision: D59405012 Reviewed By: SS-JIA
Summary: Pull Request resolved: pytorch#4171 This project adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This diffs finds the number of registers in a single thread by increasing the number of registers and finding changes in latency. For a Galaxy S22, the latency graph looks like this. {F1750619092} We can easily spot the moment where there is a spill into memory. Differential Revision: D59494644 Reviewed By: SS-JIA
Summary: Pull Request resolved: pytorch#4171 This project adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This diffs finds the number of registers in a single thread by increasing the number of registers and finding changes in latency. For a Galaxy S22, the latency graph looks like this. {F1751910439} We can easily spot the moment where there is a spill into memory. Reviewed By: SS-JIA Differential Revision: D59494644
Summary: Pull Request resolved: pytorch#4171 This project adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This diffs finds the number of registers in a single thread by increasing the number of registers and finding changes in latency. For a Galaxy S22, the latency graph looks like this. {F1750619092} We can easily spot the moment where there is a spill into memory. Differential Revision: D59494644 Reviewed By: SS-JIA
This pull request was exported from Phabricator. Differential Revision: D59494644 |
1813297
to
ef51787
Compare
This pull request has been merged in 09336a6. |
This project adds an internal implementation of https://github.com/microsoft/ArchProbe.
This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See this paper, page 4, for more information.
This diffs finds the number of registers in a single thread by increasing the number of registers and finding changes in latency. For a Galaxy S22, the latency graph looks like this.
We can easily spot the moment where there is a spill into memory.
Differential Revision: D59494644