-
Notifications
You must be signed in to change notification settings - Fork 608
RegCount NITER calculation #4159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4159
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D59405012 |
Summary: Pull Request resolved: pytorch#4159 Differential Revision: D59405012
This pull request was exported from Phabricator. Differential Revision: D59405012 |
Summary: Pull Request resolved: pytorch#4159 Differential Revision: D59405012
Summary: Pull Request resolved: pytorch#4159 Differential Revision: D59405012
This pull request was exported from Phabricator. Differential Revision: D59405012 |
Summary: Pull Request resolved: pytorch#4159 This adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests. The kernel looks like the following for any K number of registers: float reg_data0 = float(niter) + 0; float reg_data1 = float(niter) + 1; ... float reg_dataK = float(niter) + K; int i = 0; for (; i < niter; ++i) { reg_data0 *= reg_dataK; reg_data1 *= reg_data0; reg_data2 *= reg_data1; ... reg_dataK *= reg_data(K-1); } i = i >> 31; buffer_out.data[0 * i] = reg_data0; buffer_out.data[1 * i] = reg_data1; ... buffer_out.data[K * i] = reg_dataK; Differential Revision: D59405012
Summary: Pull Request resolved: pytorch#4159 This adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests. The kernel looks like the following for any K number of registers: float reg_data0 = float(niter) + 0; float reg_data1 = float(niter) + 1; ... float reg_dataK = float(niter) + K; int i = 0; for (; i < niter; ++i) { reg_data0 *= reg_dataK; reg_data1 *= reg_data0; reg_data2 *= reg_data1; ... reg_dataK *= reg_data(K-1); } i = i >> 31; buffer_out.data[0 * i] = reg_data0; buffer_out.data[1 * i] = reg_data1; ... buffer_out.data[K * i] = reg_dataK; Differential Revision: D59405012
This pull request was exported from Phabricator. Differential Revision: D59405012 |
fdaace9
to
da22ab2
Compare
Summary: Pull Request resolved: pytorch#4159 This adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests. The kernel looks like the following for any K number of registers: float reg_data0 = float(niter) + 0; float reg_data1 = float(niter) + 1; ... float reg_dataK = float(niter) + K; int i = 0; for (; i < niter; ++i) { reg_data0 *= reg_dataK; reg_data1 *= reg_data0; reg_data2 *= reg_data1; ... reg_dataK *= reg_data(K-1); } i = i >> 31; buffer_out.data[0 * i] = reg_data0; buffer_out.data[1 * i] = reg_data1; ... buffer_out.data[K * i] = reg_dataK; Reviewed By: SS-JIA Differential Revision: D59405012
Summary: Pull Request resolved: pytorch#4159 This adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests. The kernel looks like the following for any K number of registers: float reg_data0 = float(niter) + 0; float reg_data1 = float(niter) + 1; ... float reg_dataK = float(niter) + K; int i = 0; for (; i < niter; ++i) { reg_data0 *= reg_dataK; reg_data1 *= reg_data0; reg_data2 *= reg_data1; ... reg_dataK *= reg_data(K-1); } i = i >> 31; buffer_out.data[0 * i] = reg_data0; buffer_out.data[1 * i] = reg_data1; ... buffer_out.data[K * i] = reg_dataK; Differential Revision: D59405012
This pull request was exported from Phabricator. Differential Revision: D59405012 |
Summary: Pull Request resolved: pytorch#4159 This adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests. The kernel looks like the following for any K number of registers: float reg_data0 = float(niter) + 0; float reg_data1 = float(niter) + 1; ... float reg_dataK = float(niter) + K; int i = 0; for (; i < niter; ++i) { reg_data0 *= reg_dataK; reg_data1 *= reg_data0; reg_data2 *= reg_data1; ... reg_dataK *= reg_data(K-1); } i = i >> 31; buffer_out.data[0 * i] = reg_data0; buffer_out.data[1 * i] = reg_data1; ... buffer_out.data[K * i] = reg_dataK; Reviewed By: SS-JIA Differential Revision: D59405012
da22ab2
to
d3998ea
Compare
Summary: Pull Request resolved: pytorch#4159 This adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests. The kernel looks like the following for any K number of registers: float reg_data0 = float(niter) + 0; float reg_data1 = float(niter) + 1; ... float reg_dataK = float(niter) + K; int i = 0; for (; i < niter; ++i) { reg_data0 *= reg_dataK; reg_data1 *= reg_data0; reg_data2 *= reg_data1; ... reg_dataK *= reg_data(K-1); } i = i >> 31; buffer_out.data[0 * i] = reg_data0; buffer_out.data[1 * i] = reg_data1; ... buffer_out.data[K * i] = reg_dataK; Differential Revision: D59405012
Summary: Pull Request resolved: pytorch#4159 This adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests. The kernel looks like the following for any K number of registers: float reg_data0 = float(niter) + 0; float reg_data1 = float(niter) + 1; ... float reg_dataK = float(niter) + K; int i = 0; for (; i < niter; ++i) { reg_data0 *= reg_dataK; reg_data1 *= reg_data0; reg_data2 *= reg_data1; ... reg_dataK *= reg_data(K-1); } i = i >> 31; buffer_out.data[0 * i] = reg_data0; buffer_out.data[1 * i] = reg_data1; ... buffer_out.data[K * i] = reg_dataK; Differential Revision: D59405012
Summary: Pull Request resolved: pytorch#4159 This adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests. The kernel looks like the following for any K number of registers: float reg_data0 = float(niter) + 0; float reg_data1 = float(niter) + 1; ... float reg_dataK = float(niter) + K; int i = 0; for (; i < niter; ++i) { reg_data0 *= reg_dataK; reg_data1 *= reg_data0; reg_data2 *= reg_data1; ... reg_dataK *= reg_data(K-1); } i = i >> 31; buffer_out.data[0 * i] = reg_data0; buffer_out.data[1 * i] = reg_data1; ... buffer_out.data[K * i] = reg_dataK; Reviewed By: SS-JIA Differential Revision: D59405012
This pull request was exported from Phabricator. Differential Revision: D59405012 |
d3998ea
to
23949a0
Compare
Summary: Pull Request resolved: pytorch#4159 This adds an internal implementation of https://github.com/microsoft/ArchProbe. This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information. This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests. The kernel looks like the following for any K number of registers: float reg_data0 = float(niter) + 0; float reg_data1 = float(niter) + 1; ... float reg_dataK = float(niter) + K; int i = 0; for (; i < niter; ++i) { reg_data0 *= reg_dataK; reg_data1 *= reg_data0; reg_data2 *= reg_data1; ... reg_dataK *= reg_data(K-1); } i = i >> 31; buffer_out.data[0 * i] = reg_data0; buffer_out.data[1 * i] = reg_data1; ... buffer_out.data[K * i] = reg_dataK; Differential Revision: D59405012 Reviewed By: SS-JIA
This pull request has been merged in ac1c7d0. |
This adds an internal implementation of https://github.com/microsoft/ArchProbe.
This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See this paper , page 4, for more information.
This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests.
The kernel looks like the following for any K number of registers:
Differential Revision: D59405012