[Benchmark] Generate benchmark record for job failure #9247

yangw-dev · 2025-03-13T22:58:41Z

Description

compose failure benchmark record

Related Query File

https://github.com/pytorch/test-infra/blob/main/torchci/clickhouse_queries/oss_ci_benchmark_llms/query.sql

Details

when a job fails in git_job level, or a device fails in the benchmark test,
we return a benchmark record to indicate the failure, so that we can properly render the HUD UI ti distinguish metris not run and metris run with failures

This pr may introduce Unknown to fields in the HUD execubench table temporarily, will fix it in HUD UI to handle special benchmark value.

for both level of failure, the metric name will be "FAILURE_REPORT".
In HUD, We mainly use the special metric name to identify failure, if more information needed, we get which level the job fails in benchmark.extra_info.

Step Failure

When a failure detected:
we try to extract model info from git_job_name, the step will fail if the model info cannot be extracted.

Example of benchmark record for failure benchmark:

When a job failed at device-job-level,

device_name: get name from job_report.name . for instance iPhone 15
device_os: job_report.os with prefix "Android" or "iOS". this should match both android and ios setting
model.name: extract from git job name
model.backend: extract from git job name
metric.name: "FAILURE_REPORT"

{
  "benchmark": {
    "name": "ExecuTorch",
    "mode": "inference",
    "extra_info": {
      "app_type": "IOS_APP",
      "job_conclusion": **"FAILED"**,
      "failure_type": **"DEVICE_JOB"**,
      "job_report": "..."
  },
  "model": {
    "name": **"ic4",**
    "type": "OSS model",
    "backend": **"mps"**
  },
  "metric": {
    "name": **"FAILURE_REPORT"**,
    "benchmark_values": 0,
    "target_value": 0,
    "extra_info": {
      "method": ""
    }
  },
  "runners": [
    {
      "name": **"iPhone 15"**,
      "type": **"iOS 18.0"**,
    }
  ]
}

when a job failed at git-job-level (there is no job_reports)

this happens when a job fails before it runs the benchmark job

device_name: device_pool_name from git job bane #exmaple: sumsung_galaxy_22
device_os: "Android" or "iOS"
model.name: extract from git job name
model.backend: extract from git job name
metric.name: "FAILURE_REPORT"

the failure benchmark record looks like:

{
  "benchmark": {
    "name": "ExecuTorch",
    "mode": "inference",
    "extra_info": {
      "app_type": "IOS_APP",
      "job_conclusion": **"FAILURE"**,
      "failure_type": **"GIT_JOB"**,
      "job_report": "{}"
    }
  },
  "model": {
    "name": "ic4",
    "type": "OSS model",
    "backend": "mps"
  },
  "metric": {
    "name": "FAILURE_REPORT",
    ...
  },
  "runners": [
    {
      "name": "samsung_galaxy_s22",
      "type": "Android",
     ...
    }
  ]
}

pytorch-bot · 2025-03-13T22:58:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9247

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 69c42fa with merge base dd9a85a ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / run-emulator (gh) (trunk failure)
Process completed with exit code 9.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

.github/scripts/extract_benchmark_results.py

huydhn · 2025-03-14T19:37:50Z

.github/scripts/extract_benchmark_results.py

+    Job can be failed at two levels: GIT_JOB and DEVICE_JOB. If any job fails, generate failure benchmark record.
+    """
+    artifacts = content.get("artifacts")
+    git_job_name = content["git_job_name"]


A word of caution when trying to extract the information about the run from the job name. The job name comes from this line https://github.com/pytorch/executorch/blob/main/.ci/scripts/gather_benchmark_configs.py#L335. So, I think:

Make the error when failing to parse the job name in extract_model_info clearer by referring to gather_benchmark_configs script. Mostly likely, it has been updated without updating extract_benchmark_results

Add a comment on both script that they need to be in sync

sounds good!

I also raise exception for get_app_type and get_device_os_type.
Added unittest to those cases too

@huydhn
not urgent: maybe we can add a unit test to check if this works as expected. could crate a common lib to share some configs between this script and another one.

i also added comment in perf.yml for job name step change too.
if this script uses out of those yml files, we can make the regex prefix more flexible to let user pass the step name for chekcing

.github/scripts/extract_benchmark_results.py

huydhn · 2025-03-15T02:37:32Z

While checking the results from your branch on the dashboard, I notice a curious issue where older commits from your branch have a newer timestamps:

It seems like an issue for later.

huydhn

LGTM! Let's take an action item to notice the next failure from these workflows after this lands to double check the results on the dashboard before resolving the issue

yangw-dev · 2025-03-17T20:09:17Z

While checking the results from your branch on the dashboard, I notice a curious issue where older commits from your branch have a newer timestamps:

It seems like an issue for later.

interesting, created a issue here :pytorch/test-infra#6427

yangw-dev · 2025-03-17T20:16:39Z

LGTM! Let's take an action item to notice the next failure from these workflows after this lands to double check the results on the dashboard before resolving the issue

sounds good, I also need to update UI

# Description compose failure benchmark record # Related Query File https://github.com/pytorch/test-infra/blob/main/torchci/clickhouse_queries/oss_ci_benchmark_llms/query.sql # Details when a job fails in git_job level, or a device fails in the benchmark test, we return a benchmark record to indicate the failure, so that we can properly render the HUD UI ti distinguish metris not run and metris run with failures This pr may introduce `Unknown` to fields in the HUD execubench table temporarily, will fix it in HUD UI to handle special benchmark value. for both level of failure, the metric name will be "FAILURE_REPORT". In HUD, We mainly use the special metric name to identify failure, if more information needed, we get which level the job fails in benchmark.extra_info. ## Step Failure When a failure detected: we try to extract model info from git_job_name, the step will fail if the model info cannot be extracted. # Example of benchmark record for failure benchmark: ## When a job failed at device-job-level, - device_name: get name from job_report.name . for instance `iPhone 15` - device_os: job_report.os with prefix "Android" or "iOS". this should match both android and ios setting - model.name: extract from git job name - model.backend: extract from git job name - metric.name: "FAILURE_REPORT" ``` { "benchmark": { "name": "ExecuTorch", "mode": "inference", "extra_info": { "app_type": "IOS_APP", "job_conclusion": **"FAILED"**, "failure_type": **"DEVICE_JOB"**, "job_report": "..." }, "model": { "name": **"ic4",** "type": "OSS model", "backend": **"mps"** }, "metric": { "name": **"FAILURE_REPORT"**, "benchmark_values": 0, "target_value": 0, "extra_info": { "method": "" } }, "runners": [ { "name": **"iPhone 15"**, "type": **"iOS 18.0"**, } ] } ``` ## when a job failed at git-job-level (there is no job_reports) this happens when a job fails before it runs the benchmark job - device_name: device_pool_name from git job bane #exmaple: sumsung_galaxy_22 - device_os: "Android" or "iOS" - model.name: extract from git job name - model.backend: extract from git job name - metric.name: "FAILURE_REPORT" the failure benchmark record looks like: ``` { "benchmark": { "name": "ExecuTorch", "mode": "inference", "extra_info": { "app_type": "IOS_APP", "job_conclusion": **"FAILURE"**, "failure_type": **"GIT_JOB"**, "job_report": "{}" } }, "model": { "name": "ic4", "type": "OSS model", "backend": "mps" }, "metric": { "name": "FAILURE_REPORT", ... }, "runners": [ { "name": "samsung_galaxy_s22", "type": "Android", ... } ] } ```

add test

c2bd579

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 13, 2025

yangw-dev requested a review from huydhn March 13, 2025 23:04

yangw-dev added the topic: not user facing label Mar 13, 2025

yangw-dev commented Mar 13, 2025

View reviewed changes

.github/scripts/extract_benchmark_results.py Show resolved Hide resolved

add test

75338a8

yangw-dev had a problem deploying to upload-benchmark-results March 14, 2025 00:14 — with GitHub Actions Failure

yangw-dev added 2 commits March 13, 2025 17:28

add test

b3afd5a

add test

c823e0d

yangw-dev changed the title ~~add test~~ [Benchmark] Generate benchmark record for job failure Mar 14, 2025

yangw-dev added 4 commits March 13, 2025 18:00

add test

13baaec

add test

4972ab4

add os prefix

8730b5a

fix lint

2052792

yangw-dev had a problem deploying to upload-benchmark-results March 14, 2025 02:44 — with GitHub Actions Failure

yangw-dev had a problem deploying to upload-benchmark-results March 14, 2025 02:56 — with GitHub Actions Failure

yangw-dev added 3 commits March 13, 2025 20:11

ftestt

935297a

ftestt

003aa73

ftestt

4f9804b

yangw-dev commented Mar 14, 2025

View reviewed changes

.github/scripts/extract_benchmark_results.py Show resolved Hide resolved

yangw-dev temporarily deployed to upload-benchmark-results March 14, 2025 04:44 — with GitHub Actions Inactive

yangw-dev temporarily deployed to upload-benchmark-results March 14, 2025 05:16 — with GitHub Actions Inactive

Merge branch 'main' into addFake

c76c198

yangw-dev marked this pull request as ready for review March 14, 2025 06:06

yangw-dev requested a review from ZainRizvi March 14, 2025 06:06

yangw-dev temporarily deployed to upload-benchmark-results March 14, 2025 06:50 — with GitHub Actions Inactive

yangw-dev temporarily deployed to upload-benchmark-results March 14, 2025 07:37 — with GitHub Actions Inactive