-
Notifications
You must be signed in to change notification settings - Fork 314
(3.6.0 ‐ latest) Prolog hangs due to long GPU health check times on certain instance types
AWS ParallelCluster 3.6.0 and later, when configured with GPU health checks, may experience delays and eventual "prolog hung" errors when using instance types like p4d and p5. These instances have a complex GPU topology, which results in lengthy diagnostic checks during the GPU health check process. The Prolog, which runs the check before job tasks are started, must complete on all allocated nodes. If one node's GPU health check takes too long, the entire job setup is delayed, causing the "prolog hung" error:
slurmstepd: error: Prolog hung on node xxx
Testing with g4dn and g6 instance types shows that this issue occurs rarely, as these instances have simpler GPU configurations that do not require such long diagnostic times.
All ParallelCluster 3.6.0+ versions using Slurm scheduler on instance types such as p4d and p5 with GPU health checks enabled are affected. This issue may also occur with other instance types if the GPU health check takes too long.
- Modify the Slurm configuration to change the Prolog behavior, by adding the below parameter to the
/opt/slurm/slurm.conf
on the Head Node:This configuration ensures that the Prolog scripts run during resource allocation rather than right before task launch, removing the hard wait for Prolog completion on all nodes and preventing the "prolog hung" error.PrologFlags=Alloc,NoHold
- Restart the Slurm controller:
sudo systemctl restart slurmctld
Notice that you can also change the Slurm settings using CustomSlurmSettings and run pcluster update-cluster
.
Slurm PrologFlags Documentation: https://slurm.schedmd.com/slurm.conf.html#OPT_PrologFlags