Skip to content

[LIT][E2E] Fix LIT hang when executing non-existent files #17505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 9, 2025

Conversation

ayylol
Copy link
Contributor

@ayylol ayylol commented Mar 18, 2025

Fixes #16351

In our containers if an exception is not immediately caught lit will hang on the following call

pool.join()

This can occur when using the internal lit shell and trying to run a program that does not exist. In this case _executeShCmd will throw an internal shell error, which will not be caught by the function directly calling it, executeShCmd, rather it is caught one function higher in the call stack in executeScriptInternal. Because that exception is percolated up the call stack instead of being immediately caught lit will hang. This patch changes the location where we catch this exception to executeShCmd instead to avoid this.

Previously to avoid this we would use the external lit shell. However this introduces some differences in how we need to write tests (i.e., needing to add --crash for certain not calls), it slightly changes how test output is printed (all in one block, rather than separated by RUN: lines), and it messes up the path to executables when running on Windows (all \ were interpreted as escapes for the next characters leading to trying to execute a non-existent file). This pr also changes the E2E tests to always use the internal lit shell.

For more background on what causes this hang see:
https://stackoverflow.com/questions/15314189/python-multiprocessing-pool-hangs-at-join
https://bugs.python.org/issue9400
python/cpython#53646

@ayylol ayylol temporarily deployed to WindowsCILock March 18, 2025 15:22 — with GitHub Actions Inactive
@ayylol
Copy link
Contributor Author

ayylol commented Mar 18, 2025

Test run with dummy test that would previously hang when using the internal shell:
https://github.com/intel/llvm/actions/runs/13925966650

On PVC, and multiple other jobs running took less than the timeout. Meaning the hang is fixed.

Failed Tests (1):
  SYCL :: dummy_test.cpp


Testing Time: 181.56s

@ayylol
Copy link
Contributor Author

ayylol commented Mar 19, 2025

Note: This fix is in upstream llvm-lit code so I've opened llvm/llvm-project#131881 as well with the same changes to TestRunner.py. However I was only able to reproduce this in our own containers. Also, these changes are necessary to enable the E2E test splitting on windows (#17335), since running the tests with the external shell messes up the paths of the executables on windows.

@ayylol ayylol marked this pull request as ready for review March 19, 2025 15:39
@ayylol ayylol requested review from a team as code owners March 19, 2025 15:39
@ayylol ayylol requested a review from slawekptak March 19, 2025 15:39
@ayylol ayylol changed the title [TESTING][SYCL][E2E] Fix LIT hang when executing non-existent files [SYCL][E2E] Fix LIT hang when executing non-existent files Mar 19, 2025
@ayylol ayylol requested a review from aelovikov-intel March 19, 2025 15:40
Copy link
Contributor

@sarnex sarnex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix seems fine even if its just locally

@ayylol ayylol changed the title [SYCL][E2E] Fix LIT hang when executing non-existent files [LIT][E2E] Fix LIT hang when executing non-existent files Mar 19, 2025
@ayylol ayylol temporarily deployed to WindowsCILock April 7, 2025 19:47 — with GitHub Actions Inactive
@ayylol ayylol temporarily deployed to WindowsCILock April 7, 2025 20:14 — with GitHub Actions Inactive
@ayylol ayylol temporarily deployed to WindowsCILock April 7, 2025 20:14 — with GitHub Actions Inactive
@ayylol ayylol temporarily deployed to WindowsCILock April 8, 2025 13:33 — with GitHub Actions Inactive
@ayylol ayylol temporarily deployed to WindowsCILock April 8, 2025 14:02 — with GitHub Actions Inactive
@ayylol ayylol temporarily deployed to WindowsCILock April 8, 2025 14:02 — with GitHub Actions Inactive
@ayylol
Copy link
Contributor Author

ayylol commented Apr 8, 2025

@intel/llvm-gatekeepers this is ready to merge.

Note, these changes correspond to upstream LLVM pr llvm/llvm-project#131881, but merging here first will unblock progress on #17335.

Once the upstream pr is merged I will cherry-pick those changes into intel/llvm.

@kbenzie kbenzie merged commit e8d1e30 into intel:sycl Apr 9, 2025
24 checks passed
@ayylol ayylol deleted the lit-hang-fix branch May 6, 2025 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SYCL] Lit hang when executing program that does not exist
4 participants