Training using DDP and SLURM #5800

rave78 · 2021-01-19T09:29:52Z

rave78
Jan 19, 2021

❓ Questions and Help

What is your question?

The current scenario is two nodes with different free GPUs. For instance, node1 has 5 free gpus and node2 has 3 free gpus. I can requested the 8 free gpus using slurm without care the number of nodes. Is there any way that I can use PL for using the 8 available gpus in this context?. I read the documentation and it looks that one constraint is to have always the same number of free gpus on each node.

karthi0804 · 2021-01-21T03:43:51Z

karthi0804
Jan 21, 2021

Even I m facing a similar situation. It would be great if someone can help us.

2 replies

rave78 Mar 25, 2021
Author

Hi @karthi0804 did you work out this issue?

karthi0804 Mar 25, 2021

No. I dint spend time further to fix it.

awaelchli · 2021-01-24T06:58:07Z

awaelchli
Jan 24, 2021

doesn't slurm determine which devices you can use? As far as I know, they are assigned to your process so if there is a way to configure that then it is probably through the slurm run script.
In Lightning, num_gpus=n, gpus=m is the only supported way to select gpu devices at the moment.

0 replies

rave78 · 2021-01-25T10:04:26Z

rave78
Jan 25, 2021
Author

@awaelchli you are right. SLURM does the assignment without a problem. However, Lightning check how many GPUs has the node. In my example, I can request 8 GPUs via SLURM (5 GPUs in node1 and 3 GPUs in node2 -- which it is transparent for me as a user). In Lightning, I set my Trainer(gpus=8) and it failed because compare the number of requested gpus and the number of available gpu on the node (e.g, compare 8 vs 5 or 3 depending on the node). Is there another way to set the trainer for this case?. In our SLURM setup, the pure PyTorch data-parallel solution works without this limitation.

In Lightning Documentation Cluster, it looks that the only available setup is to have available 8 GPUs per each node (4). What happened in this example if one node has only 7 available GPUs and another node has the 9 GPUs?

# train on 32 GPUs across 4 nodes
trainer = Trainer(gpus=8, num_nodes=4, accelerator='ddp')

0 replies

jfolz · 2021-01-26T09:39:44Z

jfolz
Jan 26, 2021

@awaelchli for context this happens on a cluster running the Pyxis plugin for Enroot containers. All env vars required by Pytorch (MASTER_PORT, MASTER_ADDR, WORLD_SIZE, RANK) are set correctly and LOCAL_RANK contains the device ID to be used by the current process. This works just fine:

torch.distributed.init_process_group("nccl")
torch.cuda.set_device(int(os.getenv("LOCAL_RANK")))

0 replies

jfolz · 2021-03-02T10:56:31Z

jfolz
Mar 2, 2021

Not to be rude or anything, but why was this moved to a discussion? I don't see what there is left to discuss.

As laid out we encountered an issue using lightning in our cluster that is needlessly hard to work around, limits scheduling options, and see no way to fix it on our end. Meanwhile bare torch just works with minimum effort.
@awaelchli mentioned that it should be possible for Slurm to assign GPUs to tasks. AFAIK that is what --gpu-bind=single does, but that option was only added in Slurm 20.11.0 and I don't think a lot of clusters run that version yet. Ours is based in deepops with Slurm 20.02.4, which I would consider pretty recent actually.
Maybe it is possible to enforce this in slurm.conf in earlier versions, but I haven't researched it since we don't consider this to be an option, as we want to support legacy DataParallel code.

0 replies

import-antigravity · 2021-03-26T20:55:02Z

import-antigravity
Mar 26, 2021

I'm not sure if there's a way to do this in Lightning. I've been trying to get around it using Ray.io's pytorch lightning accelerator: https://github.com/ray-project/ray_lightning

0 replies

Training using DDP and SLURM #5800

Uh oh!

rave78 Jan 19, 2021

❓ Questions and Help

What is your question?

Replies: 6 comments · 2 replies

Uh oh!

karthi0804 Jan 21, 2021

Uh oh!

rave78 Mar 25, 2021 Author

Uh oh!

karthi0804 Mar 25, 2021

Uh oh!

awaelchli Jan 24, 2021

Uh oh!

Uh oh!

rave78 Jan 25, 2021 Author

Uh oh!

Uh oh!

jfolz Jan 26, 2021

Uh oh!

Uh oh!

jfolz Mar 2, 2021

Uh oh!

import-antigravity Mar 26, 2021

rave78
Jan 19, 2021

Replies: 6 comments 2 replies

karthi0804
Jan 21, 2021

rave78 Mar 25, 2021
Author

awaelchli
Jan 24, 2021

rave78
Jan 25, 2021
Author

jfolz
Jan 26, 2021

jfolz
Mar 2, 2021

import-antigravity
Mar 26, 2021