Skip to content

v2.1.0: fix MPI process suicide code #1418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jsquyres opened this issue Mar 2, 2016 · 9 comments
Closed

v2.1.0: fix MPI process suicide code #1418

jsquyres opened this issue Mar 2, 2016 · 9 comments

Comments

@jsquyres
Copy link
Member

jsquyres commented Mar 2, 2016

Per discussions in Dallas, the "suicide" code in MPI (ORTE) processes isn't currently working. I.e., if an MPI (ORTE) process loses connectivity to its local orted, it's not killing itself. This can lead to orphaned MPI processes.

Filing this ticket so that we can be sure it gets fixed for the v2.0.0 release.

@rhc54
Copy link
Contributor

rhc54 commented Mar 2, 2016

I have verified this is working okay in master, but was a problem in 2.x. The following PR fixes it:

open-mpi/ompi-release#997

@jsquyres
Copy link
Member Author

jsquyres commented Mar 7, 2016

Found a case where #997 doesn't seem to fix the problem:

  • get a SLURM allocation
  • run a long job (e.g., ring of a really large message a large number of times)
  • ssh to one of the nodes in the SLURM allocation
  • killall -9 orted on that node
  • the mpirun will die with an appropriate error message, and all the other orteds and MPI processes will die, but the MPI processes on the same server as the one kill -9'ed orted will still be running

@rhc54 Is investigating.

@rhc54
Copy link
Contributor

rhc54 commented Mar 7, 2016

Sadly, I am unable to replicate this behavior on either master or 2.x. I tried adding "sleep" instead of just having the proc spin, but it made no difference. I also tried inside and outside of a SLURM allocation - no difference. As soon as the daemon is killed, everything exits as it should.

@jsquyres
Copy link
Member Author

jsquyres commented Mar 7, 2016

Huh. Let me try again and see if I can be a bit more precise about the failure case.

@jsquyres
Copy link
Member Author

jsquyres commented Mar 7, 2016

In my case, the PMIx thread in the hung MPI processes seem to be stuck here:

(gdb) bt
#0  0x00000036594accdd in nanosleep () from /lib64/libc.so.6
#1  0x00000036594e1e54 in usleep () from /lib64/libc.so.6
#2  0x00002aaaac8e7951 in OPAL_PMIX_PMIX120_PMIx_Get (proc=0x2aaaacd18710, key=0x2aaaaabc9197 "pmix.loc", info=0x175a070, ninfo=1, val=0x2aaaacd18820) at src/client/pmix_client_get.c:92
#3  0x00002aaaac8bb14b in pmix120_get (proc=0x174bf78, key=0x2aaaaabc9197 "pmix.loc", info=0x2aaaacd188d0, val=0x2aaaacd18950) at pmix120_client.c:420
#4  0x00002aaaaab02e9f in ompi_proc_complete_init_single (proc=0x174bf30) at proc/proc.c:147
#5  0x00002aaaaab03698 in ompi_proc_for_name_nolock (proc_name=...) at proc/proc.c:224
#6  0x00002aaaaab0370e in ompi_proc_for_name (proc_name=...) at proc/proc.c:244
#7  0x00002aaaaab07d1b in ompi_group_dense_lookup (group=0x172ac10, peer_id=16, allocate=true) at ../ompi/group/group.h:356
#8  0x00002aaaaab07e36 in ompi_group_get_proc_ptr (group=0x172ac10, rank=16, allocate=true) at ../ompi/group/group.h:385
#9  0x00002aaaaab07fde in try_kill_peers (comm=0x601280 <ompi_mpi_comm_world>, errcode=-12) at runtime/ompi_mpi_abort.c:97
#10 0x00002aaaaab0842b in ompi_mpi_abort (comm=0x601280 <ompi_mpi_comm_world>, errcode=-12) at runtime/ompi_mpi_abort.c:207
#11 0x00002aaaaaaee81e in ompi_errhandler_callback (status=-12, procs=0x1759af8, info=0x1759b70, cbfunc=0x2aaaac8b6c2c <cleanup_cbfunc>, cbdata=0x1759ad0) at errhandler/errhandler.c:249
#12 0x00002aaaac8b712e in notify (status=PMIX_ERR_UNREACH, procs=0x0, nprocs=0, info=0x17598a0, ninfo=1) at pmix_pmix120.c:204
#13 0x00002aaaac8d635a in opal_pmix_pmix120_pmix_errhandler_invoke (status=PMIX_ERR_UNREACH, procs=0x0, nprocs=0, info=0x0, ninfo=0) at src/util/error.c:231
#14 0x00002aaaac8df829 in lost_connection (peer=0x2aaaacb17cc0 <opal_pmix_pmix120_pmix_pmix_client_globals>, err=PMIX_ERR_UNREACH) at src/usock/usock_sendrecv.c:75
#15 0x00002aaaac8e05b8 in opal_pmix_pmix120_pmix_usock_recv_handler (sd=16, flags=2, cbdata=0x2aaaacb17cc0 <opal_pmix_pmix120_pmix_pmix_client_globals>) at src/usock/usock_sendrecv.c:390
#16 0x00002aaaab1d4513 in event_persist_closure (ev=<optimized out>, base=0x67c3b0) at event.c:1321
#17 event_process_active_single_queue (activeq=0x67c100, base=0x67c3b0) at event.c:1365
#18 event_process_active (base=<optimized out>) at event.c:1440
#19 opal_libevent2022_event_base_loop (base=0x67c3b0, flags=1) at event.c:1644
#20 0x00002aaaac8dcede in progress_engine (obj=0x67c3b0) at src/util/progress_threads.c:49
#21 0x00000036598079d1 in start_thread () from /lib64/libpthread.so.0
#22 0x00000036594e8b6d in clone () from /lib64/libc.so.6
(gdb) 

That is, if I set a debugger breakpoint one line beyond the PMIx_Get at src/client/pmix_client_get.c:92, it never gets there.

@jsquyres
Copy link
Member Author

jsquyres commented Mar 8, 2016

@rhc54 This same problem does not appear to be happening on the v2.x branch. When I killall -9 orted on server, all the MPI processes suicide properly.

@jsquyres jsquyres modified the milestones: v2.1.0, v2.0.0 Mar 14, 2016
@jsquyres
Copy link
Member Author

@rhc54 Since it's not apparently happening on the v2.x branch, I'm setting the milestone to v2.1 (on the assuming that there will be a re-sync of PMIx on master to v2.x after 2.0.x and before 2.1.x).

@jsquyres jsquyres changed the title v2.0.0: fix MPI process suicide code v2.1.0: fix MPI process suicide code Mar 14, 2016
@rhc54
Copy link
Contributor

rhc54 commented Apr 23, 2016

@jsquyres Can you please try this now? I think it may be (hopefully) fixed.

@jsquyres
Copy link
Member Author

Agreed; this appears fixed now. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants