You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's recurring instabilities on the AMD pre-commit runs, everytime
they fail two things will happen:
* 1 or more test will fail with a memory access fault
* 1 or more test will hang and end up timing out
This seemingly only happens when running the pre-built E2E tests in
parallel.
It is quite difficult to debug and could potentially be an issue in the
AMD drivers.
So as a workaround until we can figure out what's going on, this patch
switches the AMD E2E prebuit tests to run in a single thread.
This is obviously slower than running the tests in parallel, but because
the instability causes hangs that end up hitting the 10 minutes timeout,
a one thread run is faster than a failing multi-thread run. So we get
consistent runs that are slower but may actually end up going through
the job queue faster as they won't be hitting timeouts so often.
On a local setup using the same AMD GPU as the CI:
* Successful multi-thread run: ~73s
* Successful single-thread run: ~255s
* Failed multi-thread run: 600s+
0 commit comments