Skip to content

Get rid of post-commit testing #14383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

ldrumm
Copy link
Contributor

@ldrumm ldrumm commented Jul 2, 2024

This is an experiment to try and find a good balance of reliable test infrastructure that can't be ignored during merge, tests enough, and runs quickly.

It's based on the following principle:
Pull request testing should ask and answer the question

If merged, would the proposed change represent a regression to the
project?

It should answer it to a high confidence, in as short a time as possible It should not:

  • test things that are unrelated to the user's code
  • rely on dynamic data or unstable data beyond the author's control (e.g. last night's build results)
  • be obviously incomplete

With the above principles in mind the jobs here are optimized to maximise diversity coverage on available runners, while also minimizing the turnaround time to know whether a merge request is not a regression. There is a conflict here: high coverage relies on lots of hardware, and takes lots of time, but since a long turnaround time discourages contribution we have to make a very fine judgment of what is reasonable. Here we do this by building once per native platform toolchain (with potentially different configurations to broaden build coverage, and afterwards running subsets of tests on each host's available adaptors.

Mac OS = Apple Clang + libc++; DSO-linkage release build + assertions Win32 = MSVC; release build - no assertions
Linux = gcc + libstdc++ release build + assertions

This allows us to avoid building with both clang and gcc on a Linux platform - where the coverage gained is marginal (clang has slightly different diagnostics but otherwise doesn't really exercise the input code in a different manner to gcc) - but building with clang on Mac and gcc on Linux does carry a more significant chance of identifying a build regression. Since by default we enable -Werror in configuration, this allows us also to minimize the number of warnings merged, but still avoid doing multiple builds per platform.

In summary we're not really interested in testing the system compiler, but instead ensuring that a reasonable matrix of configurations (static vs dynamic vs noassert vs assert) do not regress.

We hopefully then answer the original question with reasonable confidence in correctness without the overhead of a brute-force combinations matrix.

We make the following assumptions:

  • the system compiler and standard library are sane. It's tempting to run the end to end tests with multiple host compilers, but divergence in results here most likely indicates a host compiler bug, or bad behaviour in the runtime, which would show up regardless of host codegen. So when the end to end tests fail, they fail due to the user's code, rather than a test of the system compiler.

Dogfooding is a noble goal, but it's not suitable for merge request testing, since it introduces an instability for which the author of the PR is not responsible. Thus: we hoist dogfooding into the nightly testing which doesn't affect a user's chance of reasonably getting a patch merged before the heat death of the universe.

This is an experiment to  try and find a good balance of reliable test
infrastructure that can't be ignored during merge, tests enough, and
runs quickly.

It's based on the following principle:
Pull request testing should ask and answer the question

> If merged, would the proposed change represent a regression to the
project?

It should answer it to a high confidence, in as short a time as possible
It should not:
- test things that are unrelated to the user's code
- rely on dynamic data or unstable data beyond the author's control
  (e.g. last night's build results)
- be obviously incomplete

With the above principles in mind the jobs here are optimized to
maximise diversity coverage on available runners, while also minimizing
the turnaround time to know whether a merge request is not a regression.
There is a conflict here: high coverage relies on lots of hardware, and
takes lots of time, but since a long turnaround time discourages
contribution we have to make a very fine judgment of what is reasonable.
Here we do this by building *once* per native platform toolchain (with
potentially differenct configurations to broaden *build* coverage, and
afterwards running subsets of tests on each host's available adaptors

Mac OS = Apple Clang + libc++; DSO-linkage release build + assertions
Win32 = MSVC; release build - no assertions
Linux = gcc + libstdc++ release build + assertions

This allows us to avoid building with both clang and gcc on a Linux
platform - where the coverage gained is marginal (clang has slightly
different diagnostics but otherwise doesn't really exercise the input
code in a different manner to gcc) - but building with clang on Mac and
gcc on Linux does carry a more significant chance of identifying a build
regression. Since by default we enable `-Werror` in configuration, this
allows us also to minimize the number of warnings merged, but still
avoid doing multiple builds per platform.

In summary we're not really interested in testing the system compiler,
but instead ensuring that a reasonable matrix of configurations (static
vs dynamic vs noassert vs assert) do not regress.

We hopefully then answer the original question *reasonable confidence in
correctness* without the overhead of a brute-force combinations matrix.

We make the following assumptions:
- the system compiler and standard library are sane. It's tempting to
  run the end to end tests with multiple host compilers, but divergence
  in results here most likely indicates a host compiler bug, or bad
  behaviour in the runtime, which would show up regardless of host
  codegen. So when the end to end tests fail, they fail due to the
  user's code, rather than a test of the system
compiler.

Dogfooding is a noble goal, but it's not suitable for merge request
testing, since it introduces an instability for which the author of the
PR is *not responsible*. Thus: we hoist dogfooding into the nightly
testing which doesn't affect a user's chance of reasonably getting a
patch merged before the heat death of the universe.
@ldrumm
Copy link
Contributor Author

ldrumm commented Jul 2, 2024

Thus: we hoist dogfooding into the nightly testing

Looks like I missed that step. TODO: Add a self-built run to the nightly

Copy link
Contributor

@sarnex sarnex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments, will do more in-depth review once the discussed changes are added

@@ -15,7 +15,14 @@ jobs:
with:
build_cache_root: "/__w/"
build_artifact_suffix: default
build_configure_extra_args: '--hip --cuda'
build_configure_extra_args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we build precommit linux with the same flags (It looks like we don't but maybe I can't read)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this for more diversity i.e. pre-commit is static build (llvm config default), and then nightly testing gets exercised with a shared config. I'm not married to it, but I figured it was any easy way to increase coverage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue we actually do see relatively frequently is someone breaking the shared library build (usually missing a library dep in the cmake file) and today that is usually caught in post commit because that does shared lib build. With this, we would only catch it in the nightly.

However I think it's reasonable to assume that most developers are using the default config (configure.py) and assuming we have someone guarneteed to be checking the nightly and quickly report and/or fix issues, I don't have a huge problem with this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could swap it round, so pre-commit testing catches the sharelib issues?

I wonder if it's possible to break static library build and not shared library build. If that's extremely unlikely such that either both are broken or shlib is broken, doing shlib in precommit would be fine with me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should just drop "sharedlib" build configuration. The primary use case for this configuration is to speed-up incremental build of LLVM libraries. This build configuration is not use in the production environment - it's used only by LLVM developers. If everyone uses standard build, there is no point to build "sharedlib" configuration in CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I use it rather frequently and I would be annoyed if it was failing and not caught in CI.

Copy link
Contributor

@aelovikov-intel aelovikov-intel Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our downstream uses it a lot. It would be very bad if the issues will only be discovered when syncing downstream with upstream and not in upstream CI.

- name: Perf tests on Intel Arc A-Series Graphics system
runner: '["Linux", "arc"]'
env: '{"LIT_FILTER":"PerformanceTests/"}'
extra_lit_opts: -a -j 1 --param enable-perf-tests=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need --param gpu-intel-dg2=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Will fix

- name: E2E on Intel GEN12 Graphics; OpenCL GPU, OpenCL FPGA
runner: '["Linux", "gen12"]'
extra_lit_opts: --param gpu-intel-gen12=True
target_devices: opencl:gpu;opencl:fpga
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be an issue introduced by this PR and may already exist, but I think having gpu-intel-gen12=True and multiple target devices means that the variable is true for all target device (unless we somehow set it to false in the lit python scripts if it's not gpu). Probably we only want that variable to be true for GPU targets.

@aelovikov-intel Any idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a parallel activity to auto-detect architecture. I think that would solve the issue, but I don't remember the status of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be already available: #13976

Copy link
Contributor

@sarnex sarnex Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks! Probably we need to either map the results there to the gpu-intel-* we have been using, or the other way around. Having two ways to specify the gpu seems bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks! Probably we need to either map the results there to the gpu-intel-* we have been using, or the other way around. Having two ways to specific the gpu seems bad.

That's the "in progress" part that I'm unaware about the status of. @AlexeySachkov , someone from your team is looking into it, right?

@@ -80,6 +87,37 @@ jobs:
image_options: -u 1001 --device=/dev/dri --privileged --cap-add SYS_ADMIN
target_devices: opencl:cpu
tests_selector: cts

- name: E2E tests with dev igc on Intel Arc A-Series Graphics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update your commit message with the summary of changes related to Arc GPU testing. @sarnex and @YuriPlyakhin would have to approve the changes to the devigc usage.

Copy link
Contributor

@sarnex sarnex Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems as part of this PR we stop doing arc dev igc testing in precommit and only do it in the nightly. This is fine with me but we need @jsji's approval, he worked on the dev igc CI and is more familiar with the use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldrumm , why do you suggest to remove dev igc testing from precommit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the time spent in running the dev igc testing is fairly small, we should try to keep it in pre-commit if it is possible -- moving it from pre-commit to post-commit (nightly) would make things complicated unless we can identify the culprit commit and ping author automatically in post-commit (nightly).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @jsji said

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time testing the dev driver might be small but it results in an instability. The principle here is to have a stable testing platform where the user's PR is tested against a stable config, not where a users PR is tested against unstable GPU drivers. If we're to have any unstable test plastforms available in pre-commit they should be explicitly opt-in only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It basically boils down to "GPU driver developers should do their own testing, not experiment on dpc++ contributors"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ldrumm , understand the principle. However, the dev igc testing is not trying to do testing for GPU driver, but rather doing the continuous integration tests for Joint Matrix/ESIMD features, which require IGC changes time-to-time. The testing already restricted to Joint Matrix and ESIMD only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @ldrumm , dev igc driver is being updated only after Joint Matrix/ESIMD E2E tests passed with it, so it is stable in context of precommit testing.

@@ -14,7 +14,7 @@ on:
build_image:
type: string
required: false
default: "ghcr.io/intel/llvm/ubuntu2204_build:latest"
default: "ghcr.io/intel/llvm/ubuntu2204_build:latest-0300ac924620a51f76c4929794637b82790f12ab"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this? At the very least, it should have a comment in the code explaining it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as the other places that use this tag.

 $ ag 0300ac92462 -A 1 -B 1
workflows/sycl-linux-build.yml
16-        required: false
17:        default: "ghcr.io/intel/llvm/ubuntu2204_build:latest-0300ac924620a51f76c4929794637b82790f12ab"
18-      build_ref:

workflows/sycl-precommit.yml
81-            runner: '["Linux", "amdgpu"]'
82:            image: ghcr.io/intel/llvm/ubuntu2204_build:latest-0300ac924620a51f76c4929794637b82790f12ab
83-            image_options: -u 1001 --device=/dev/dri --device=/dev/kfd

workflows/sycl-linux-precommit-aws.yml
66-      runner: '["aws_cuda-${{ github.event.workflow_run.id }}-${{ github.event.workflow_run.run_attempt }}"]'
67:      image: ghcr.io/intel/llvm/ubuntu2204_build:latest-0300ac924620a51f76c4929794637b82790f12ab
68-      image_options: -u 1001 --gpus all --cap-add SYS_ADMIN --env NVIDIA_DISABLE_REQUIRE=1

What should the comment say? "Use stable tag 0300ac9"? Seems redundant to me

use_dev_igc: ${{ contains(needs.detect_changes.outputs.filters, 'devigccfg') }}
extra_lit_opts: --param matrix-xmx8=True --param gpu-intel-dg2=True

# Performance tests below. Specifics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are very dumb for now and the idea behind them is to be able to access the logs to look through performance number on a per-commit basis. I don't see a reason to add this into nightly...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll move them back

sycl_toolchain_archive: ${{ needs.build_linux.outputs.artifact_archive_name }}
sycl_toolchain_decompress_command: ${{ needs.build_linux.outputs.artifact_decompress_command }}

build_win:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing the split between lin/win tasks? I would have expected you to link the commit that introduced the split in the PR's description together with your arguments on why you think that was wrong.

Also, you're not removing sycl-windows-precommit.yml, why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, even if we decide to keep that change, it should be done in a separate PR and not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing the split between lin/win tasks? I would have expected you to link the commit that introduced the split in the PR's description together with your arguments on why you think that was wrong.

So people can see all precommit tests together. It never occurred to me that they might have been together at one point and then separated out, so I didn't consider something was done "wrong", thus no need to focus on

Also, you're not removing sycl-windows-precommit.yml, why?

That's a mistake. Will fix.

Also, even if we decide to keep that change, it should be done in a separate PR and not here.

Which change? Why does it need a separate merge request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which change?

Merging win/lin into a single YML, obviously...

Why does it need a separate merge request?

Because that would be an atomic change and per good software development practices PRs have to represent atomic changes, not a bunch of different fixes at once.

It never occurred to me that they might have been together at one point and then separated out

They were. If you decide to merge them back, please make it clear in your new atomic PR why original reasoning doesn't take place anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously

Not obviously. Don't patronise me. Your description was unclear and I'm not a mind reader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first comment in this thread was attributed to a single change:

Why are you removing the split between lin/win tasks?

The second (and the only other comment at that time) in this thread was

if we decide to keep that change...

What else could it possibly refer to, in your opinion?

extra_lit_opts: --param gpu-intel-gen12=True
target_devices: opencl:gpu;opencl:fpga

- name: E2E on Intel Arc A-Series Graphics; Level zero GPU, OpenCL CPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: E2E on Intel Arc A-Series Graphics; Level zero GPU, OpenCL CPU
- name: E2E on Intel Arc A-Series Graphics; Level zero GPU, OpenCL GPU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Isn't this FGPA and not GPU?

Copy link
Contributor

@YuriPlyakhin YuriPlyakhin Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally on ARC we tested with 2 GPU backends: level 0 and OpenCL: target_devices: ext_oneapi_level_zero:gpu;opencl:gpu
Not sure, where FPGA or CPU is coming from...

@@ -98,23 +105,16 @@ jobs:
install_drivers: ${{ contains(needs.detect_changes.outputs.filters, 'drivers') }}
extra_lit_opts: --param matrix-xmx8=True --param gpu-intel-dg2=True
env: '{"LIT_FILTER":${{ needs.determine_arc_tests.outputs.arc_tests }} }'
- name: E2E tests with dev igc on Intel Arc A-Series Graphics

- name: E2E on Intel Arc A-Series Graphics
Copy link
Contributor

@YuriPlyakhin YuriPlyakhin Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the name be more specific on what targets are being tested, for example: OpenCL FPGA, OpenCL GPU, etc....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, I'll have a look

extra_lit_opts: --param matrix-xmx8=True --param gpu-intel-dg2=True
env: '{"LIT_FILTER":${{ needs.determine_arc_tests.outputs.arc_tests }} }'
target_devices: opencl:gpu;opencl:opencl:fpga
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCL GPU is being tested with previous "E2E on Intel Arc A-Series Graphics" run, right?
If yes, we should not repeat it again probably.

Also, what is the point to test OpenCL FPGA on HW with Intel ARC? I would use more powerful system for that, and GPU is not necessary for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this i think fpga testing on any system is the same, it just needs to be powerful enough to not timeout/etc

@bader bader removed their assignment Jul 2, 2024
@ldrumm ldrumm temporarily deployed to WindowsCILock July 3, 2024 11:14 — with GitHub Actions Inactive
runner: '["Linux", "amdgpu"]'
image: ghcr.io/intel/llvm/ubuntu2204_build:latest-0300ac924620a51f76c4929794637b82790f12ab
image_options: -u 1001 --device=/dev/dri --device=/dev/kfd
target_devices: ext_oneapi_hip:gpu
- name: Intel

- name: E2E on Intel GEN12; level zero GPU, OpenCL CPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere in the code: Level Zero - both words should start with capital letters

@ldrumm
Copy link
Contributor Author

ldrumm commented Jul 3, 2024 via email

Copy link
Contributor

This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

@github-actions github-actions bot added the Stale label Dec 31, 2024
Copy link
Contributor

This pull request was closed because it has been stalled for 30 days with no activity.

@github-actions github-actions bot closed this Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.