-
Notifications
You must be signed in to change notification settings - Fork 902
PMIX - SLURM OpenMPI #11471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I missed to mention that the jobs indicated earlier are running on K8s pods / containers, using a full rocky 9.1 OS. kubectl get pod -n slurm |
Good day, I am not familiar with K8s container but I had this issue my firewall is activated on the computes nodes. When I disabled the firewall, mpi + slurm worked fine. |
I've gotten caught by this problem too running slurm 23.11.2 and pmix 4.2.9, however, i am not running a firewall. the only thing i can add is, i also see this in the slurmctld log slurm_send_node_msg: [socket:[3652652]] slurm_bufs_sendto(msg_type=SRUN_JOB_COMPLETE) failed: Connection reset by peer i logged a support issue with slurm as well, but i'm not sure where the issue stems from srun --mpi=pmix -n1 -N1 /bin/hostname -- works it stands to reason since i'm working on a new environment, there's something in it causing this, but i'm not sure what is, any clues for debugging would be helpful thanks |
Just to clarify: you configured Slurm with PMIx v4.2.9 and it is failing - true? If so, only thing I can suggest is asking them about it. I believe that error message indicates that a slurmd lost its connection back to srun - and since you are running just Your first use-case doesn't cause two slurmd's to communicate since you only run on one node. Second one does, and obviously the connection is broken upon job complete. Third one also creates the connection, but indicates that the connection is being broken by something to do with the PMIx plugin. Which is a black box to me, I'm afraid 🤷♂️ |
it turns out, this yanked from the slurmd process log on the compute nodes mpi/pmix_v4: pmixp_usock_create_srv: hostname [0]: pmixp_utils.c:105: Cannot bind() UNIX socket /slurm/current/spool/slurmd/stepd.slurm.pmix.29.0: Address already in use (98) indicated that my clients were all using the same /var/spool/slurmd directory, which was true since it's an nfs mount i had to change the slurmdspooldir variable to add %n to the end so each compute node has a different directory it can work in from it's neighbor. what's interesting though is that even though the job works now, i still get these in the logs srun: debug: mpi/pmix_v3: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:382: Abort agent port: 36317 |
I'm suspicious of those debug messages as they are coming from the wrong Slurm PMIx plugin - your earlier note correctly showed you using the |
yes, sorry its a copy error, they're pmix_v4 |
Those might be legitimate debug messages, then - I don't see any indication of an error. |
yes, sorry for the false alarm. now that i re-read the msg more closely, it's not actually an error. it looks like there's some abort daemon that pmix is loading and this is just a notice that it's starting up. |
Dear Support,
I am using opnempi version 4.1.5 installed using spack
[root@login-cluster-1 hello-world]# mpirun --version
mpirun (Open MPI) 4.1.5
If i run a simple hello using mpirun everything works fine, my problem is that when i try using srun (slurm scheduler with PMIX) on multiple nodes
[root@login-cluster-1 hello-world]# mpirun --allow-run-as-root -n 16 --hostfile hostfile --prefix /scratch/hussein/spack/myenv /scratch/hussein/hello-world/hello_mpi
Hello from process 12 of 16
Hello from process 3 of 16
Hello from process 14 of 16
Hello from process 5 of 16
Hello from process 13 of 16
Hello from process 15 of 16
Hello from process 6 of 16
Hello from process 11 of 16
Hello from process 4 of 16
Hello from process 9 of 16
Hello from process 10 of 16
Hello from process 1 of 16
Hello from process 2 of 16
Hello from process 8 of 16
Hello from process 0 of 16
Hello from process 7 of 16
running srun on a single node using PMIX:
[root@login-cluster-1 hello-world]# srun -N1 -n 16 --mpi=pmix /scratch/hussein/hello-world/hello_mpi
Hello from process 1 of 16
Hello from process 15 of 16
Hello from process 0 of 16
Hello from process 2 of 16
Hello from process 3 of 16
Hello from process 4 of 16
Hello from process 5 of 16
Hello from process 8 of 16
Hello from process 9 of 16
Hello from process 10 of 16
Hello from process 11 of 16
Hello from process 12 of 16
Hello from process 13 of 16
Hello from process 14 of 16
Hello from process 6 of 16
Hello from process 7 of 16
The version of supported PMIX :
[root@login-cluster-1 hello-world]# srun --mpi=list
MPI plugin types are...
none
pmi2
pmix
specific pmix plugin versions available: pmix_v3
An example of running hello_mpi on two nodes
[root@login-cluster-1 hello-world]# srun -vv -N2 -n 16 --mpi=pmix /scratch/hussein/hello-world/hello_mpi
srun: defined options
srun: -------------------- --------------------
srun: mpi : pmix
srun: nodes : 2
srun: ntasks : 16
srun: verbose : 2
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating RLIMIT_CPU=18446744073709551615
srun: debug: propagating RLIMIT_FSIZE=18446744073709551615
srun: debug: propagating RLIMIT_DATA=18446744073709551615
srun: debug: propagating RLIMIT_STACK=8388608
srun: debug: propagating RLIMIT_CORE=0
srun: debug: propagating RLIMIT_RSS=18446744073709551615
srun: debug: propagating RLIMIT_NPROC=18446744073709551615
srun: debug: propagating RLIMIT_NOFILE=1024
srun: debug: propagating RLIMIT_AS=18446744073709551615
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: debug: Entering slurm_allocation_msg_thr_create()
srun: debug: port from net_stream_listen is 38699
srun: debug: Entering _msg_thr_internal
srun: debug: auth/munge: init: Munge authentication plugin loaded
srun: debug: hash/k12: init: init: KangarooTwelve hash plugin loaded
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: Nodes compute001-cluster-1,compute002-cluster-1 are ready for job
srun: jobid 14: nodes(2):`compute001-cluster-1,compute002-cluster-1', cpu counts: 15(x1),1(x1)
srun: debug: requesting job 14, user 0, nodes 2 including ((null))
srun: debug: cpus 16, tasks 16, name hello_mpi, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug: Entering slurm_step_launch
srun: debug: mpi/pmix_v3: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:382: Abort agent port: 36317
srun: debug: mpi/pmix_v3: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:281: setup process mapping in srun
srun: debug: Entering _msg_thr_create()
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:353: Start abort thread
srun: debug: initialized stdio listening socket, port 46811
srun: debug: Started IO server thread (140168583353920)
srun: debug: Entering _launch_tasks
srun: launching StepId=14.0 on host compute001-cluster-1, 15 tasks: [0-14]
srun: launching StepId=14.0 on host compute002-cluster-1, 1 tasks: 15
srun: route/default: init: route default plugin loaded
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: Complete StepId=14.0+0 received
srun: launch/slurm: launch_p_step_launch: StepId=14.0 aborted before step completely launched.
srun: error: task 15 launch failed: Unspecified error
srun: launch/slurm: _task_start: Node compute002-cluster-1, 1 tasks started
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Complete StepId=14.0+0 received
srun: launch/slurm: _task_start: Node compute001-cluster-1, 15 tasks started
slurmstepd: error: *** STEP 14.0 ON compute001-cluster-1 CANCELLED AT 2023-03-07T15:21:45 ***
srun: launch/slurm: _task_finish: Received task exit notification for 15 tasks of StepId=14.0 (status=0x0009).
srun: error: compute001-cluster-1: tasks 0-14: Killed
srun: launch/slurm: _step_signal: Terminating StepId=14.0
srun: debug: task 0 done
srun: debug: task 1 done
srun: debug: task 2 done
srun: debug: task 3 done
srun: debug: task 4 done
srun: debug: task 5 done
srun: debug: task 6 done
srun: debug: task 7 done
srun: debug: task 8 done
srun: debug: task 9 done
srun: debug: task 10 done
srun: debug: task 11 done
srun: debug: task 12 done
srun: debug: task 13 done
srun: debug: task 14 done
srun: debug: IO thread exiting
srun: debug: mpi/pmix_v3: _conn_readable: (null) [0]: pmixp_agent.c:105: false, shutdown
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:355: Abort thread exit
srun: debug: Leaving _msg_thr_internal
Thanks for your support
The text was updated successfully, but these errors were encountered: