RegCount NITER calculation #4159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

Esteb37 wants to merge 1 commit into pytorch:main from Esteb37:export-D59405012

Contributor

Esteb37 commented Jul 5, 2024 •

edited

Loading

This adds an internal implementation of https://github.com/microsoft/ArchProbe.

This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See this paper , page 4, for more information.

This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests.

The kernel looks like the following for any K number of registers:

float reg_data0 = float(niter) + 0;
float reg_data1 = float(niter) + 1;
...
float reg_dataK = float(niter) + K;

int i = 0;
for (; i < niter; ++i) {
  reg_data0 *= reg_dataK;
  reg_data1 *= reg_data0;
  reg_data2 *= reg_data1;
  ...
  reg_dataK *= reg_data(K-1);
}

i = i >> 31;

buffer_out.data[0 * i] = reg_data0;
buffer_out.data[1 * i] = reg_data1;
...
buffer_out.data[K * i] = reg_dataK;

Differential Revision: D59405012

pytorch-bot bot commented Jul 5, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4159

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[PREEMPTIVE] CPU amd64 non-GPU instances migration to Linux AMZN2023

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Jul 5, 2024

This pull request was exported from Phabricator. Differential Revision: D59405012

facebook-github-bot added the fb-exported label

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          RegCount NITER calculation (pytorch#4159)

3d3babf

Summary: Pull Request resolved: pytorch#4159

Differential Revision: D59405012

Contributor

facebook-github-bot commented Jul 8, 2024

This pull request was exported from Phabricator. Differential Revision: D59405012

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          RegCount NITER calculation (pytorch#4159)

0e83e07

Summary: Pull Request resolved: pytorch#4159

Differential Revision: D59405012

Esteb37 force-pushed the export-D59405012 branch from d1dab92 to 0e83e07 Compare

July 8, 2024 22:07

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          RegCount NITER calculation (pytorch#4159)

68bbba5

Summary: Pull Request resolved: pytorch#4159

Differential Revision: D59405012

Contributor

facebook-github-bot commented Jul 9, 2024

This pull request was exported from Phabricator. Differential Revision: D59405012

Esteb37 force-pushed the export-D59405012 branch from 0e83e07 to fdaace9 Compare

July 9, 2024 14:41

SS-JIA approved these changes

View reviewed changes

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          RegCount NITER calculation (pytorch#4159)

b7f3b20

Summary:
Pull Request resolved: pytorch#4159

This adds an internal implementation of https://github.com/microsoft/ArchProbe.

This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information.

This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests.

The kernel looks like the following for any K number of registers:

  float reg_data0 = float(niter) + 0;
  float reg_data1 = float(niter) + 1;
  ...
  float reg_dataK = float(niter) + K;

  int i = 0;
  for (; i < niter; ++i) {
    reg_data0 *= reg_dataK;
    reg_data1 *= reg_data0;
    reg_data2 *= reg_data1;
    ...
    reg_dataK *= reg_data(K-1);
  }

  i = i >> 31;

  buffer_out.data[0 * i] = reg_data0;
  buffer_out.data[1 * i] = reg_data1;
  ...
  buffer_out.data[K * i] = reg_dataK;

Differential Revision: D59405012

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          RegCount NITER calculation (pytorch#4159)

8cb2508

Summary:
Pull Request resolved: pytorch#4159

This adds an internal implementation of https://github.com/microsoft/ArchProbe.

This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information.

This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests.

The kernel looks like the following for any K number of registers:

  float reg_data0 = float(niter) + 0;
  float reg_data1 = float(niter) + 1;
  ...
  float reg_dataK = float(niter) + K;

  int i = 0;
  for (; i < niter; ++i) {
    reg_data0 *= reg_dataK;
    reg_data1 *= reg_data0;
    reg_data2 *= reg_data1;
    ...
    reg_dataK *= reg_data(K-1);
  }

  i = i >> 31;

  buffer_out.data[0 * i] = reg_data0;
  buffer_out.data[1 * i] = reg_data1;
  ...
  buffer_out.data[K * i] = reg_dataK;

Differential Revision: D59405012

Contributor

facebook-github-bot commented Jul 11, 2024

This pull request was exported from Phabricator. Differential Revision: D59405012

Esteb37 force-pushed the export-D59405012 branch from fdaace9 to da22ab2 Compare

July 11, 2024 14:36

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          RegCount NITER calculation (pytorch#4159)

da22ab2

Summary:
Pull Request resolved: pytorch#4159

This adds an internal implementation of https://github.com/microsoft/ArchProbe.

This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information.

This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests.

The kernel looks like the following for any K number of registers:

  float reg_data0 = float(niter) + 0;
  float reg_data1 = float(niter) + 1;
  ...
  float reg_dataK = float(niter) + K;

  int i = 0;
  for (; i < niter; ++i) {
    reg_data0 *= reg_dataK;
    reg_data1 *= reg_data0;
    reg_data2 *= reg_data1;
    ...
    reg_dataK *= reg_data(K-1);
  }

  i = i >> 31;

  buffer_out.data[0 * i] = reg_data0;
  buffer_out.data[1 * i] = reg_data1;
  ...
  buffer_out.data[K * i] = reg_dataK;

Reviewed By: SS-JIA

Differential Revision: D59405012

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          RegCount NITER calculation (pytorch#4159)

053b0a7

Summary:
Pull Request resolved: pytorch#4159

This adds an internal implementation of https://github.com/microsoft/ArchProbe.

This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information.

This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests.

The kernel looks like the following for any K number of registers:

  float reg_data0 = float(niter) + 0;
  float reg_data1 = float(niter) + 1;
  ...
  float reg_dataK = float(niter) + K;

  int i = 0;
  for (; i < niter; ++i) {
    reg_data0 *= reg_dataK;
    reg_data1 *= reg_data0;
    reg_data2 *= reg_data1;
    ...
    reg_dataK *= reg_data(K-1);
  }

  i = i >> 31;

  buffer_out.data[0 * i] = reg_data0;
  buffer_out.data[1 * i] = reg_data1;
  ...
  buffer_out.data[K * i] = reg_dataK;

Differential Revision: D59405012

Contributor

facebook-github-bot commented Jul 11, 2024

This pull request was exported from Phabricator. Differential Revision: D59405012

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          RegCount NITER calculation (pytorch#4159)

d3998ea

Summary:
Pull Request resolved: pytorch#4159

This adds an internal implementation of https://github.com/microsoft/ArchProbe.

This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information.

This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests.

The kernel looks like the following for any K number of registers:

  float reg_data0 = float(niter) + 0;
  float reg_data1 = float(niter) + 1;
  ...
  float reg_dataK = float(niter) + K;

  int i = 0;
  for (; i < niter; ++i) {
    reg_data0 *= reg_dataK;
    reg_data1 *= reg_data0;
    reg_data2 *= reg_data1;
    ...
    reg_dataK *= reg_data(K-1);
  }

  i = i >> 31;

  buffer_out.data[0 * i] = reg_data0;
  buffer_out.data[1 * i] = reg_data1;
  ...
  buffer_out.data[K * i] = reg_dataK;

Reviewed By: SS-JIA

Differential Revision: D59405012

Esteb37 force-pushed the export-D59405012 branch from da22ab2 to d3998ea Compare

July 11, 2024 15:23

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          RegCount NITER calculation (pytorch#4159)

df26343

Summary:
Pull Request resolved: pytorch#4159

This adds an internal implementation of https://github.com/microsoft/ArchProbe.

This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information.

This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests.

The kernel looks like the following for any K number of registers:

  float reg_data0 = float(niter) + 0;
  float reg_data1 = float(niter) + 1;
  ...
  float reg_dataK = float(niter) + K;

  int i = 0;
  for (; i < niter; ++i) {
    reg_data0 *= reg_dataK;
    reg_data1 *= reg_data0;
    reg_data2 *= reg_data1;
    ...
    reg_dataK *= reg_data(K-1);
  }

  i = i >> 31;

  buffer_out.data[0 * i] = reg_data0;
  buffer_out.data[1 * i] = reg_data1;
  ...
  buffer_out.data[K * i] = reg_dataK;

Differential Revision: D59405012

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          RegCount NITER calculation (pytorch#4159)

b644855

Summary:
Pull Request resolved: pytorch#4159

This adds an internal implementation of https://github.com/microsoft/ArchProbe.

This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information.

This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests.

The kernel looks like the following for any K number of registers:

  float reg_data0 = float(niter) + 0;
  float reg_data1 = float(niter) + 1;
  ...
  float reg_dataK = float(niter) + K;

  int i = 0;
  for (; i < niter; ++i) {
    reg_data0 *= reg_dataK;
    reg_data1 *= reg_data0;
    reg_data2 *= reg_data1;
    ...
    reg_dataK *= reg_data(K-1);
  }

  i = i >> 31;

  buffer_out.data[0 * i] = reg_data0;
  buffer_out.data[1 * i] = reg_data1;
  ...
  buffer_out.data[K * i] = reg_dataK;

Differential Revision: D59405012


          RegCount NITER calculation (pytorch#4159)

23949a0

Summary:
Pull Request resolved: pytorch#4159

This adds an internal implementation of https://github.com/microsoft/ArchProbe.

This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information.

This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests.

The kernel looks like the following for any K number of registers:

  float reg_data0 = float(niter) + 0;
  float reg_data1 = float(niter) + 1;
  ...
  float reg_dataK = float(niter) + K;

  int i = 0;
  for (; i < niter; ++i) {
    reg_data0 *= reg_dataK;
    reg_data1 *= reg_data0;
    reg_data2 *= reg_data1;
    ...
    reg_dataK *= reg_data(K-1);
  }

  i = i >> 31;

  buffer_out.data[0 * i] = reg_data0;
  buffer_out.data[1 * i] = reg_data1;
  ...
  buffer_out.data[K * i] = reg_dataK;

Reviewed By: SS-JIA

Differential Revision: D59405012

Contributor

facebook-github-bot commented Jul 11, 2024

This pull request was exported from Phabricator. Differential Revision: D59405012

Esteb37 force-pushed the export-D59405012 branch from d3998ea to 23949a0 Compare

July 11, 2024 16:42

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          RegCount NITER calculation (pytorch#4159)

902c963

Summary:
Pull Request resolved: pytorch#4159

This adds an internal implementation of https://github.com/microsoft/ArchProbe.

This stack introduces a kernel that can be used to get the number of available registers on a mobile GPU by gradually increasing the number of accessed elements and detecting dramatic drops in performance. See [this paper ](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf), page 4, for more information.

This first diff gets the number of iterations (NITER) that can run in 1000us, to be used in the following tests.

The kernel looks like the following for any K number of registers:

  float reg_data0 = float(niter) + 0;
  float reg_data1 = float(niter) + 1;
  ...
  float reg_dataK = float(niter) + K;

  int i = 0;
  for (; i < niter; ++i) {
    reg_data0 *= reg_dataK;
    reg_data1 *= reg_data0;
    reg_data2 *= reg_data1;
    ...
    reg_dataK *= reg_data(K-1);
  }

  i = i >> 31;

  buffer_out.data[0 * i] = reg_data0;
  buffer_out.data[1 * i] = reg_data1;
  ...
  buffer_out.data[K * i] = reg_dataK;

Differential Revision: D59405012

Reviewed By: SS-JIA

facebook-github-bot closed this in

ac1c7d0

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Jul 11, 2024

This pull request has been merged in ac1c7d0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported Merged