Skip to content

[CI] Tune nightly benchmarking job for better reliability #17122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Mar 13, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
8f45038
Add PVC_PERF runner as sole option
ianayl Feb 21, 2025
0a8b0d2
Turn down fail threshold
ianayl Feb 21, 2025
e672cda
Restrict number of cores used
ianayl Feb 28, 2025
684f94e
Fix missing shell directive
ianayl Feb 28, 2025
57e530e
Tweak thresholds
ianayl Feb 28, 2025
aaa19f0
Bump up tolerance
ianayl Feb 28, 2025
bdf68d3
Bump iterations for more consistent results
ianayl Feb 28, 2025
b495154
Bump iterations
ianayl Feb 28, 2025
384c4d6
Reduce number of iterations
ianayl Feb 28, 2025
dd3f861
Lower number of iterations
ianayl Feb 28, 2025
0ee8a9a
Temporarily lower min_threshold to test results
ianayl Feb 28, 2025
5cab659
Require more samples before comparison
ianayl Mar 3, 2025
e556368
Tweak tolerance and add check for intel/llvm in nightly
ianayl Mar 5, 2025
84eda44
Increase iterations to 5000 again
ianayl Mar 5, 2025
565bfbd
Reduce tolerance to 5%
ianayl Mar 5, 2025
247ed16
Merge branch 'sycl' of https://github.com/intel/llvm into ianayl/tune…
ianayl Mar 5, 2025
04d5220
Bump tolerance to 7%
ianayl Mar 5, 2025
7637810
Bump tolerance up to 8%
ianayl Mar 6, 2025
98a9b3d
Test 10000 iterations
ianayl Mar 6, 2025
13f86ec
Revert "Test 10000 iterations"
ianayl Mar 6, 2025
fedb018
Do not reset GPU
ianayl Mar 10, 2025
26180cd
Readd back /dev/dri/by-path to docker image arguments
ianayl Mar 12, 2025
302d6b0
Re-enable resetting the GPU
ianayl Mar 12, 2025
fdfbb9a
Switch out manual triggers for installing igc drivers for resetting i…
ianayl Mar 12, 2025
08fb37a
Fix capitalization
ianayl Mar 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .github/workflows/sycl-linux-run-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,14 @@ jobs:
env: ${{ fromJSON(inputs.env) }}
steps:
- name: Reset Intel GPU
if: inputs.reset_intel_gpu == 'true'
if: >
${{ inputs.reset_intel_gpu == 'true' ||
( github.event_name == 'workflow_dispatch' &&
inputs.tests_selector == 'compute-benchmarks' &&
inputs.runner == '["PVC_PERF"]' ) }}
# Specifically, manual dispatch running compute-benchmarks on PVC_PERF
# should reset intel GPU, since manual dispatch does not provide an option
# for users to pick whether or not manual dispatch resets the GPU.
run: |
sudo mount -t debugfs none /sys/kernel/debug
base_dir="/sys/kernel/debug/dri"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/sycl-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ jobs:
runner: '["PVC_PERF"]'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should include the OS here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to add a "Linux" tag to the UR runner then, although it's worth noting that the UR folks would prefer to not have other jobs run on their runners, so we'll need to make sure whatever tags we add to that runner does not result in other workflows picking it up.

For now, I'll add "Linux" somewhere in the name

Copy link
Contributor

@sarnex sarnex Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wow the runner doesnt have the linux tag automatically, thats weird

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was intentional though, I would check up with @pbalcer

Copy link
Contributor

@pbalcer pbalcer Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukaszstolarczuk was the one who set it up ;-) I'm not an expert on runners.

But yeah, the original idea behind this system was that it'd just have one runner script instance, used exclusively for the benchmarks. But right now this system isn't very busy, so...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explicitly used --no-default-labels for setting this runner up. I guess you can update labels, to whatever you wish, pls just remember to update the ur-benchmarks-reusable.yml afterwards.

On general note - what Piotr said - we wanted this runner to be used mostly for performance, as it's also used for measuring UMF perf and too much traffic may influence the results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we wanted this runner to be used mostly for performance, as it's also used for measuring UMF perf and too much traffic may influence the results.

+1 Yeah, I also wanted to reserve this runner for performance benchmarking exclusively. @lukaszstolarczuk there's only one GHA process running on this runner, right? If so, we can be sure that only one benchmarking job will be executing at a given time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this moment it is not a one runner. We want to move UMF into intel org - only then we can have a single, shared runner. For now we have 2 runners (one for SYCL, one for UMF), each bound to a different NUMA node.

image_options: -u 1001 --device=/dev/dri -v /dev/dri/by-path:/dev/dri/by-path --privileged --cap-add SYS_ADMIN
target_devices: level_zero:gpu
reset_intel_gpu: false
reset_intel_gpu: true
uses: ./.github/workflows/sycl-linux-run-tests.yml
secrets: inherit
with:
Expand Down
Loading