-
Notifications
You must be signed in to change notification settings - Fork 902
v2.1.0: fix MPI process suicide code #1418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have verified this is working okay in master, but was a problem in 2.x. The following PR fixes it: |
Found a case where #997 doesn't seem to fix the problem:
@rhc54 Is investigating. |
Sadly, I am unable to replicate this behavior on either master or 2.x. I tried adding "sleep" instead of just having the proc spin, but it made no difference. I also tried inside and outside of a SLURM allocation - no difference. As soon as the daemon is killed, everything exits as it should. |
Huh. Let me try again and see if I can be a bit more precise about the failure case. |
In my case, the PMIx thread in the hung MPI processes seem to be stuck here:
That is, if I set a debugger breakpoint one line beyond the PMIx_Get at src/client/pmix_client_get.c:92, it never gets there. |
@rhc54 This same problem does not appear to be happening on the v2.x branch. When I |
@rhc54 Since it's not apparently happening on the v2.x branch, I'm setting the milestone to v2.1 (on the assuming that there will be a re-sync of PMIx on master to v2.x after 2.0.x and before 2.1.x). |
@jsquyres Can you please try this now? I think it may be (hopefully) fixed. |
Agreed; this appears fixed now. Thanks. |
Per discussions in Dallas, the "suicide" code in MPI (ORTE) processes isn't currently working. I.e., if an MPI (ORTE) process loses connectivity to its local orted, it's not killing itself. This can lead to orphaned MPI processes.
Filing this ticket so that we can be sure it gets fixed for the v2.0.0 release.
The text was updated successfully, but these errors were encountered: