v2.1.0: fix MPI process suicide code #1418

jsquyres · 2016-03-02T04:19:00Z

Per discussions in Dallas, the "suicide" code in MPI (ORTE) processes isn't currently working. I.e., if an MPI (ORTE) process loses connectivity to its local orted, it's not killing itself. This can lead to orphaned MPI processes.

Filing this ticket so that we can be sure it gets fixed for the v2.0.0 release.

rhc54 · 2016-03-02T15:11:43Z

I have verified this is working okay in master, but was a problem in 2.x. The following PR fixes it:

open-mpi/ompi-release#997

jsquyres · 2016-03-07T21:37:35Z

Found a case where #997 doesn't seem to fix the problem:

get a SLURM allocation
run a long job (e.g., ring of a really large message a large number of times)
ssh to one of the nodes in the SLURM allocation
killall -9 orted on that node
the mpirun will die with an appropriate error message, and all the other orteds and MPI processes will die, but the MPI processes on the same server as the one kill -9'ed orted will still be running

@rhc54 Is investigating.

rhc54 · 2016-03-07T22:40:14Z

Sadly, I am unable to replicate this behavior on either master or 2.x. I tried adding "sleep" instead of just having the proc spin, but it made no difference. I also tried inside and outside of a SLURM allocation - no difference. As soon as the daemon is killed, everything exits as it should.

jsquyres · 2016-03-07T22:43:47Z

Huh. Let me try again and see if I can be a bit more precise about the failure case.

jsquyres · 2016-03-07T22:51:00Z

In my case, the PMIx thread in the hung MPI processes seem to be stuck here:

(gdb) bt
#0  0x00000036594accdd in nanosleep () from /lib64/libc.so.6
#1  0x00000036594e1e54 in usleep () from /lib64/libc.so.6
#2  0x00002aaaac8e7951 in OPAL_PMIX_PMIX120_PMIx_Get (proc=0x2aaaacd18710, key=0x2aaaaabc9197 "pmix.loc", info=0x175a070, ninfo=1, val=0x2aaaacd18820) at src/client/pmix_client_get.c:92
#3  0x00002aaaac8bb14b in pmix120_get (proc=0x174bf78, key=0x2aaaaabc9197 "pmix.loc", info=0x2aaaacd188d0, val=0x2aaaacd18950) at pmix120_client.c:420
#4  0x00002aaaaab02e9f in ompi_proc_complete_init_single (proc=0x174bf30) at proc/proc.c:147
#5  0x00002aaaaab03698 in ompi_proc_for_name_nolock (proc_name=...) at proc/proc.c:224
#6  0x00002aaaaab0370e in ompi_proc_for_name (proc_name=...) at proc/proc.c:244
#7  0x00002aaaaab07d1b in ompi_group_dense_lookup (group=0x172ac10, peer_id=16, allocate=true) at ../ompi/group/group.h:356
#8  0x00002aaaaab07e36 in ompi_group_get_proc_ptr (group=0x172ac10, rank=16, allocate=true) at ../ompi/group/group.h:385
#9  0x00002aaaaab07fde in try_kill_peers (comm=0x601280 <ompi_mpi_comm_world>, errcode=-12) at runtime/ompi_mpi_abort.c:97
#10 0x00002aaaaab0842b in ompi_mpi_abort (comm=0x601280 <ompi_mpi_comm_world>, errcode=-12) at runtime/ompi_mpi_abort.c:207
#11 0x00002aaaaaaee81e in ompi_errhandler_callback (status=-12, procs=0x1759af8, info=0x1759b70, cbfunc=0x2aaaac8b6c2c <cleanup_cbfunc>, cbdata=0x1759ad0) at errhandler/errhandler.c:249
#12 0x00002aaaac8b712e in notify (status=PMIX_ERR_UNREACH, procs=0x0, nprocs=0, info=0x17598a0, ninfo=1) at pmix_pmix120.c:204
#13 0x00002aaaac8d635a in opal_pmix_pmix120_pmix_errhandler_invoke (status=PMIX_ERR_UNREACH, procs=0x0, nprocs=0, info=0x0, ninfo=0) at src/util/error.c:231
#14 0x00002aaaac8df829 in lost_connection (peer=0x2aaaacb17cc0 <opal_pmix_pmix120_pmix_pmix_client_globals>, err=PMIX_ERR_UNREACH) at src/usock/usock_sendrecv.c:75
#15 0x00002aaaac8e05b8 in opal_pmix_pmix120_pmix_usock_recv_handler (sd=16, flags=2, cbdata=0x2aaaacb17cc0 <opal_pmix_pmix120_pmix_pmix_client_globals>) at src/usock/usock_sendrecv.c:390
#16 0x00002aaaab1d4513 in event_persist_closure (ev=<optimized out>, base=0x67c3b0) at event.c:1321
#17 event_process_active_single_queue (activeq=0x67c100, base=0x67c3b0) at event.c:1365
#18 event_process_active (base=<optimized out>) at event.c:1440
#19 opal_libevent2022_event_base_loop (base=0x67c3b0, flags=1) at event.c:1644
#20 0x00002aaaac8dcede in progress_engine (obj=0x67c3b0) at src/util/progress_threads.c:49
#21 0x00000036598079d1 in start_thread () from /lib64/libpthread.so.0
#22 0x00000036594e8b6d in clone () from /lib64/libc.so.6
(gdb)

That is, if I set a debugger breakpoint one line beyond the PMIx_Get at src/client/pmix_client_get.c:92, it never gets there.

jsquyres · 2016-03-08T18:20:29Z

@rhc54 This same problem does not appear to be happening on the v2.x branch. When I killall -9 orted on server, all the MPI processes suicide properly.

jsquyres · 2016-03-14T17:10:22Z

@rhc54 Since it's not apparently happening on the v2.x branch, I'm setting the milestone to v2.1 (on the assuming that there will be a re-sync of PMIx on master to v2.x after 2.0.x and before 2.1.x).

rhc54 · 2016-04-23T14:24:30Z

@jsquyres Can you please try this now? I think it may be (hopefully) fixed.

jsquyres · 2016-04-25T15:53:27Z

Agreed; this appears fixed now. Thanks.

jsquyres added bug Severity: blocker labels Mar 2, 2016

jsquyres added this to the v2.0.0 milestone Mar 2, 2016

rhc54 mentioned this issue Mar 2, 2016

Repair the application suicide code for 2.0 - requires custom patch. open-mpi/ompi-release#997

Merged

jsquyres modified the milestones: v2.1.0, v2.0.0 Mar 14, 2016

jsquyres changed the title ~~v2.0.0: fix MPI process suicide code~~ v2.1.0: fix MPI process suicide code Mar 14, 2016

jsquyres closed this as completed Apr 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2.1.0: fix MPI process suicide code #1418

v2.1.0: fix MPI process suicide code #1418

jsquyres commented Mar 2, 2016

rhc54 commented Mar 2, 2016

Uh oh!

jsquyres commented Mar 7, 2016

Uh oh!

rhc54 commented Mar 7, 2016

Uh oh!

jsquyres commented Mar 7, 2016

Uh oh!

jsquyres commented Mar 7, 2016

Uh oh!

jsquyres commented Mar 8, 2016

Uh oh!

jsquyres commented Mar 14, 2016

Uh oh!

rhc54 commented Apr 23, 2016

Uh oh!

jsquyres commented Apr 25, 2016

Uh oh!

v2.1.0: fix MPI process suicide code #1418

v2.1.0: fix MPI process suicide code #1418

Comments

jsquyres commented Mar 2, 2016

rhc54 commented Mar 2, 2016

Uh oh!

jsquyres commented Mar 7, 2016

Uh oh!

rhc54 commented Mar 7, 2016

Uh oh!

jsquyres commented Mar 7, 2016

Uh oh!

jsquyres commented Mar 7, 2016

Uh oh!

jsquyres commented Mar 8, 2016

Uh oh!

jsquyres commented Mar 14, 2016

Uh oh!

rhc54 commented Apr 23, 2016

Uh oh!

jsquyres commented Apr 25, 2016

Uh oh!