Skip to content

(3.6.0 ‐ latest) Prolog hangs due to long GPU health check times on certain instance types

Xuanqi He edited this page Apr 3, 2025 · 2 revisions

The issue

AWS ParallelCluster 3.6.0 and later, when configured with GPU health checks, may experience delays and eventual "prolog hung" errors when using instance types like p4d and p5. These instances have a complex GPU topology, which results in lengthy diagnostic checks during the GPU health check process. The Prolog, which runs the check before job tasks are started, must complete on all allocated nodes. If one node's GPU health check takes too long, the entire job setup is delayed, causing the "prolog hung" error:

slurmstepd: error: Prolog hung on node xxx

Testing with g4dn and g6 instance types shows that this issue occurs rarely, as these instances have simpler GPU configurations that do not require such long diagnostic times.

Affected Versions

All ParallelCluster 3.6.0+ versions using Slurm scheduler on instance types such as p4d and p5 with GPU health checks enabled are affected. This issue may also occur with other instance types if the GPU health check takes too long.

Mitigation

  1. Modify the Slurm configuration to change the Prolog behavior, by adding the below parameter to the /opt/slurm/slurm.conf on the Head Node:
    PrologFlags=Alloc,NoHold
    
    This configuration ensures that the Prolog scripts run during resource allocation rather than right before task launch, removing the hard wait for Prolog completion on all nodes and preventing the "prolog hung" error.
  2. Restart the Slurm controller:
    sudo systemctl restart slurmctld
    

Notice that you can also change the Slurm settings using CustomSlurmSettings and run pcluster update-cluster.

Reference

Slurm PrologFlags Documentation: https://slurm.schedmd.com/slurm.conf.html#OPT_PrologFlags

Clone this wiki locally