-
Notifications
You must be signed in to change notification settings - Fork 608
[Benchmark] Generate benchmark record for job failure #9247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9247
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 69c42fa with merge base dd9a85a ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Job can be failed at two levels: GIT_JOB and DEVICE_JOB. If any job fails, generate failure benchmark record. | ||
""" | ||
artifacts = content.get("artifacts") | ||
git_job_name = content["git_job_name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A word of caution when trying to extract the information about the run from the job name. The job name comes from this line https://github.com/pytorch/executorch/blob/main/.ci/scripts/gather_benchmark_configs.py#L335. So, I think:
- Make the error when failing to parse the job name in
extract_model_info
clearer by referring togather_benchmark_configs
script. Mostly likely, it has been updated without updatingextract_benchmark_results
- Add a comment on both script that they need to be in sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also raise exception for get_app_type and get_device_os_type.
Added unittest to those cases too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huydhn
not urgent: maybe we can add a unit test to check if this works as expected. could crate a common lib to share some configs between this script and another one.
i also added comment in perf.yml for job name step change too.
if this script uses out of those yml files, we can make the regex prefix more flexible to let user pass the step name for chekcing
While checking the results from your branch on the dashboard, I notice a curious issue where older commits from your branch have a newer timestamps: It seems like an issue for later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Let's take an action item to notice the next failure from these workflows after this lands to double check the results on the dashboard before resolving the issue
interesting, created a issue here :pytorch/test-infra#6427 |
sounds good, I also need to update UI |
# Description compose failure benchmark record # Related Query File https://github.com/pytorch/test-infra/blob/main/torchci/clickhouse_queries/oss_ci_benchmark_llms/query.sql # Details when a job fails in git_job level, or a device fails in the benchmark test, we return a benchmark record to indicate the failure, so that we can properly render the HUD UI ti distinguish metris not run and metris run with failures This pr may introduce `Unknown` to fields in the HUD execubench table temporarily, will fix it in HUD UI to handle special benchmark value. for both level of failure, the metric name will be "FAILURE_REPORT". In HUD, We mainly use the special metric name to identify failure, if more information needed, we get which level the job fails in benchmark.extra_info. ## Step Failure When a failure detected: we try to extract model info from git_job_name, the step will fail if the model info cannot be extracted. # Example of benchmark record for failure benchmark: ## When a job failed at device-job-level, - device_name: get name from job_report.name . for instance `iPhone 15` - device_os: job_report.os with prefix "Android" or "iOS". this should match both android and ios setting - model.name: extract from git job name - model.backend: extract from git job name - metric.name: "FAILURE_REPORT" ``` { "benchmark": { "name": "ExecuTorch", "mode": "inference", "extra_info": { "app_type": "IOS_APP", "job_conclusion": **"FAILED"**, "failure_type": **"DEVICE_JOB"**, "job_report": "..." }, "model": { "name": **"ic4",** "type": "OSS model", "backend": **"mps"** }, "metric": { "name": **"FAILURE_REPORT"**, "benchmark_values": 0, "target_value": 0, "extra_info": { "method": "" } }, "runners": [ { "name": **"iPhone 15"**, "type": **"iOS 18.0"**, } ] } ``` ## when a job failed at git-job-level (there is no job_reports) this happens when a job fails before it runs the benchmark job - device_name: device_pool_name from git job bane #exmaple: sumsung_galaxy_22 - device_os: "Android" or "iOS" - model.name: extract from git job name - model.backend: extract from git job name - metric.name: "FAILURE_REPORT" the failure benchmark record looks like: ``` { "benchmark": { "name": "ExecuTorch", "mode": "inference", "extra_info": { "app_type": "IOS_APP", "job_conclusion": **"FAILURE"**, "failure_type": **"GIT_JOB"**, "job_report": "{}" } }, "model": { "name": "ic4", "type": "OSS model", "backend": "mps" }, "metric": { "name": "FAILURE_REPORT", ... }, "runners": [ { "name": "samsung_galaxy_s22", "type": "Android", ... } ] } ```
Description
compose failure benchmark record
Issue: pytorch/test-infra#6294
Related Query File
https://github.com/pytorch/test-infra/blob/main/torchci/clickhouse_queries/oss_ci_benchmark_llms/query.sql
Details
when a job fails in git_job level, or a device fails in the benchmark test,
we return a benchmark record to indicate the failure, so that we can properly render the HUD UI ti distinguish metris not run and metris run with failures
This pr may introduce
Unknown
to fields in the HUD execubench table temporarily, will fix it in HUD UI to handle special benchmark value.for both level of failure, the metric name will be "FAILURE_REPORT".
In HUD, We mainly use the special metric name to identify failure, if more information needed, we get which level the job fails in benchmark.extra_info.
Step Failure
When a failure detected:
we try to extract model info from git_job_name, the step will fail if the model info cannot be extracted.
Example of benchmark record for failure benchmark:
When a job failed at device-job-level,
iPhone 15
when a job failed at git-job-level (there is no job_reports)
this happens when a job fails before it runs the benchmark job
the failure benchmark record looks like: