Training using DDP and SLURM #5800
Replies: 6 comments 2 replies
-
Even I m facing a similar situation. It would be great if someone can help us. |
Beta Was this translation helpful? Give feedback.
-
doesn't slurm determine which devices you can use? As far as I know, they are assigned to your process so if there is a way to configure that then it is probably through the slurm run script. |
Beta Was this translation helpful? Give feedback.
-
@awaelchli you are right. SLURM does the assignment without a problem. However, Lightning check how many GPUs has the node. In my example, I can request 8 GPUs via SLURM (5 GPUs in node1 and 3 GPUs in node2 -- which it is transparent for me as a user). In Lightning, I set my Trainer(gpus=8) and it failed because compare the number of requested gpus and the number of available gpu on the node (e.g, compare 8 vs 5 or 3 depending on the node). Is there another way to set the trainer for this case?. In our SLURM setup, the pure PyTorch data-parallel solution works without this limitation. In Lightning Documentation Cluster, it looks that the only available setup is to have available 8 GPUs per each node (4). What happened in this example if one node has only 7 available GPUs and another node has the 9 GPUs? # train on 32 GPUs across 4 nodes
trainer = Trainer(gpus=8, num_nodes=4, accelerator='ddp') |
Beta Was this translation helpful? Give feedback.
-
@awaelchli for context this happens on a cluster running the Pyxis plugin for Enroot containers. All env vars required by Pytorch (MASTER_PORT, MASTER_ADDR, WORLD_SIZE, RANK) are set correctly and LOCAL_RANK contains the device ID to be used by the current process. This works just fine:
|
Beta Was this translation helpful? Give feedback.
-
Not to be rude or anything, but why was this moved to a discussion? I don't see what there is left to discuss. As laid out we encountered an issue using lightning in our cluster that is needlessly hard to work around, limits scheduling options, and see no way to fix it on our end. Meanwhile bare torch just works with minimum effort. |
Beta Was this translation helpful? Give feedback.
-
I'm not sure if there's a way to do this in Lightning. I've been trying to get around it using Ray.io's pytorch lightning accelerator: https://github.com/ray-project/ray_lightning |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
❓ Questions and Help
What is your question?
The current scenario is two nodes with different free GPUs. For instance, node1 has 5 free gpus and node2 has 3 free gpus. I can requested the 8 free gpus using slurm without care the number of nodes. Is there any way that I can use PL for using the 8 available gpus in this context?. I read the documentation and it looks that one constraint is to have always the same number of free gpus on each node.
Beta Was this translation helpful? Give feedback.
All reactions