Skip to content

PMIX - SLURM OpenMPI #11471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
husseinharake opened this issue Mar 7, 2023 · 9 comments
Open

PMIX - SLURM OpenMPI #11471

husseinharake opened this issue Mar 7, 2023 · 9 comments

Comments

@husseinharake
Copy link

Dear Support,

I am using opnempi version 4.1.5 installed using spack

[root@login-cluster-1 hello-world]# mpirun --version
mpirun (Open MPI) 4.1.5

If i run a simple hello using mpirun everything works fine, my problem is that when i try using srun (slurm scheduler with PMIX) on multiple nodes

[root@login-cluster-1 hello-world]# mpirun --allow-run-as-root -n 16 --hostfile hostfile --prefix /scratch/hussein/spack/myenv /scratch/hussein/hello-world/hello_mpi
Hello from process 12 of 16
Hello from process 3 of 16
Hello from process 14 of 16
Hello from process 5 of 16
Hello from process 13 of 16
Hello from process 15 of 16
Hello from process 6 of 16
Hello from process 11 of 16
Hello from process 4 of 16
Hello from process 9 of 16
Hello from process 10 of 16
Hello from process 1 of 16
Hello from process 2 of 16
Hello from process 8 of 16
Hello from process 0 of 16
Hello from process 7 of 16

running srun on a single node using PMIX:

[root@login-cluster-1 hello-world]# srun -N1 -n 16 --mpi=pmix /scratch/hussein/hello-world/hello_mpi
Hello from process 1 of 16
Hello from process 15 of 16
Hello from process 0 of 16
Hello from process 2 of 16
Hello from process 3 of 16
Hello from process 4 of 16
Hello from process 5 of 16
Hello from process 8 of 16
Hello from process 9 of 16
Hello from process 10 of 16
Hello from process 11 of 16
Hello from process 12 of 16
Hello from process 13 of 16
Hello from process 14 of 16
Hello from process 6 of 16
Hello from process 7 of 16

The version of supported PMIX :

[root@login-cluster-1 hello-world]# srun --mpi=list
MPI plugin types are...
none
pmi2
pmix
specific pmix plugin versions available: pmix_v3

An example of running hello_mpi on two nodes

[root@login-cluster-1 hello-world]# srun -vv -N2 -n 16 --mpi=pmix /scratch/hussein/hello-world/hello_mpi
srun: defined options
srun: -------------------- --------------------
srun: mpi : pmix
srun: nodes : 2
srun: ntasks : 16
srun: verbose : 2
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating RLIMIT_CPU=18446744073709551615
srun: debug: propagating RLIMIT_FSIZE=18446744073709551615
srun: debug: propagating RLIMIT_DATA=18446744073709551615
srun: debug: propagating RLIMIT_STACK=8388608
srun: debug: propagating RLIMIT_CORE=0
srun: debug: propagating RLIMIT_RSS=18446744073709551615
srun: debug: propagating RLIMIT_NPROC=18446744073709551615
srun: debug: propagating RLIMIT_NOFILE=1024
srun: debug: propagating RLIMIT_AS=18446744073709551615
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: debug: Entering slurm_allocation_msg_thr_create()
srun: debug: port from net_stream_listen is 38699
srun: debug: Entering _msg_thr_internal
srun: debug: auth/munge: init: Munge authentication plugin loaded
srun: debug: hash/k12: init: init: KangarooTwelve hash plugin loaded
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: Nodes compute001-cluster-1,compute002-cluster-1 are ready for job
srun: jobid 14: nodes(2):`compute001-cluster-1,compute002-cluster-1', cpu counts: 15(x1),1(x1)
srun: debug: requesting job 14, user 0, nodes 2 including ((null))
srun: debug: cpus 16, tasks 16, name hello_mpi, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug: Entering slurm_step_launch
srun: debug: mpi/pmix_v3: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:382: Abort agent port: 36317
srun: debug: mpi/pmix_v3: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:281: setup process mapping in srun
srun: debug: Entering _msg_thr_create()
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:353: Start abort thread
srun: debug: initialized stdio listening socket, port 46811
srun: debug: Started IO server thread (140168583353920)
srun: debug: Entering _launch_tasks
srun: launching StepId=14.0 on host compute001-cluster-1, 15 tasks: [0-14]
srun: launching StepId=14.0 on host compute002-cluster-1, 1 tasks: 15
srun: route/default: init: route default plugin loaded
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: Complete StepId=14.0+0 received
srun: launch/slurm: launch_p_step_launch: StepId=14.0 aborted before step completely launched.
srun: error: task 15 launch failed: Unspecified error
srun: launch/slurm: _task_start: Node compute002-cluster-1, 1 tasks started
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Complete StepId=14.0+0 received
srun: launch/slurm: _task_start: Node compute001-cluster-1, 15 tasks started
slurmstepd: error: *** STEP 14.0 ON compute001-cluster-1 CANCELLED AT 2023-03-07T15:21:45 ***
srun: launch/slurm: _task_finish: Received task exit notification for 15 tasks of StepId=14.0 (status=0x0009).
srun: error: compute001-cluster-1: tasks 0-14: Killed
srun: launch/slurm: _step_signal: Terminating StepId=14.0
srun: debug: task 0 done
srun: debug: task 1 done
srun: debug: task 2 done
srun: debug: task 3 done
srun: debug: task 4 done
srun: debug: task 5 done
srun: debug: task 6 done
srun: debug: task 7 done
srun: debug: task 8 done
srun: debug: task 9 done
srun: debug: task 10 done
srun: debug: task 11 done
srun: debug: task 12 done
srun: debug: task 13 done
srun: debug: task 14 done
srun: debug: IO thread exiting
srun: debug: mpi/pmix_v3: _conn_readable: (null) [0]: pmixp_agent.c:105: false, shutdown
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:355: Abort thread exit
srun: debug: Leaving _msg_thr_internal

Thanks for your support

@jsquyres jsquyres added this to the v4.1.6 milestone Mar 7, 2023
@husseinharake
Copy link
Author

I missed to mention that the jobs indicated earlier are running on K8s pods / containers, using a full rocky 9.1 OS.

kubectl get pod -n slurm
NAME READY STATUS RESTARTS AGE
accounting-cluster-1-79c7f7ff46-s9s9d 2/2 Running 0 19h
compute001-cluster-1-5c4cd6c86d-mgvtg 1/1 Running 0 19h
compute002-cluster-1-8854cd7c8-bj5kc 1/1 Running 0 19h
compute003-cluster-1-74dbf6f556-xmv66 1/1 Running 0 19h
login-cluster-1-c5d7478f8-xplzm 1/1 Running 0 19h
slurmctl-cluster-1-856f76d74f-hfgq9 1/1 Running 0 19h

@bwbarrett bwbarrett modified the milestones: v4.1.6, v4.1.7 Sep 30, 2023
@anfray-m
Copy link

Good day,

I am not familiar with K8s container but I had this issue my firewall is activated on the computes nodes. When I disabled the firewall, mpi + slurm worked fine.

@mdidomenico4
Copy link

I've gotten caught by this problem too running slurm 23.11.2 and pmix 4.2.9, however, i am not running a firewall.

the only thing i can add is, i also see this in the slurmctld log

slurm_send_node_msg: [socket:[3652652]] slurm_bufs_sendto(msg_type=SRUN_JOB_COMPLETE) failed: Connection reset by peer

i logged a support issue with slurm as well, but i'm not sure where the issue stems from

srun --mpi=pmix -n1 -N1 /bin/hostname -- works
srun --mpi=pmix -n2 -N2 /bin/hostname -- does not work
srun --mpi=none -n2 -N2 /bin/hostname -- works

it stands to reason since i'm working on a new environment, there's something in it causing this, but i'm not sure what is, any clues for debugging would be helpful

thanks

@rhc54
Copy link
Contributor

rhc54 commented Feb 29, 2024

Just to clarify: you configured Slurm with PMIx v4.2.9 and it is failing - true? If so, only thing I can suggest is asking them about it. I believe that error message indicates that a slurmd lost its connection back to srun - and since you are running just hostname, it can't have anything to do with OMPI's PMIx integration.

Your first use-case doesn't cause two slurmd's to communicate since you only run on one node. Second one does, and obviously the connection is broken upon job complete. Third one also creates the connection, but indicates that the connection is being broken by something to do with the PMIx plugin.

Which is a black box to me, I'm afraid 🤷‍♂️

@mdidomenico4
Copy link

it turns out, this yanked from the slurmd process log on the compute nodes

mpi/pmix_v4: pmixp_usock_create_srv: hostname [0]: pmixp_utils.c:105: Cannot bind() UNIX socket /slurm/current/spool/slurmd/stepd.slurm.pmix.29.0: Address already in use (98)
mpi/pmix_v4: pmixp_stepd_init: hostname [0]: pmixp_server.c:387: pmixp_usock_create_srv
mpi/pmix_v4: mpi_p_slurmstepd_prefok: (null) [0]: mpi_pmix.c:222: pmixps_stepd_init() failed
failed mpi_g_slurmstepd_prefork
job_manager: exiting abnormally: unspecified error
stepd_cleaniup: done with step(rc[]:Unspecififed error, cleanup_rc[]:Unspecified error)

indicated that my clients were all using the same /var/spool/slurmd directory, which was true since it's an nfs mount

i had to change the slurmdspooldir variable to add %n to the end so each compute node has a different directory it can work in from it's neighbor.

what's interesting though is that even though the job works now, i still get these in the logs

srun: debug: mpi/pmix_v3: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:382: Abort agent port: 36317
srun: debug: mpi/pmix_v3: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:281: setup process mapping in srun
srun: debug: Entering _msg_thr_create()

@rhc54
Copy link
Contributor

rhc54 commented Feb 29, 2024

I'm suspicious of those debug messages as they are coming from the wrong Slurm PMIx plugin - your earlier note correctly showed you using the pmix_v4 plugin. With PMIx v4.x, you should not be seeing pmix_v3 messages - unless there is a typo in the plugin somewhere.

@mdidomenico4
Copy link

yes, sorry its a copy error, they're pmix_v4

@rhc54
Copy link
Contributor

rhc54 commented Feb 29, 2024

Those might be legitimate debug messages, then - I don't see any indication of an error.

@mdidomenico4
Copy link

yes, sorry for the false alarm. now that i re-read the msg more closely, it's not actually an error. it looks like there's some abort daemon that pmix is loading and this is just a notice that it's starting up.

@jsquyres jsquyres modified the milestones: v4.1.7, v4.1.8 Jan 23, 2025
@bwbarrett bwbarrett modified the milestones: v4.1.8, v4.1.9 Feb 5, 2025
@bwbarrett bwbarrett modified the milestones: v4.1.9, v4.1.10 Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants