FullyShardedDataParallel with 3D segmentation #1134

clarkbab · 2023-01-04T00:53:04Z

clarkbab
Jan 4, 2023

Hi there,

I originally posted this question in the PyTorch forums because it is related to use of the FullyShardedDataParallel, but I thought it would be worthwhile reposting here.

I'm trying to adapt the BRATS 3D segmentation tutorial to run using multiple GPUs to reduce the memory footprint and hence train using higher-resolution input images. Has anybody had any success with this?

Please see the original post for full details.

Thanks,
Brett

Nic-Ma · 2023-01-04T03:14:24Z

Nic-Ma
Jan 4, 2023
Maintainer

Hi @KumoLiu ,

Could you please help share some info for this question?

Thanks in advance.

0 replies

clarkbab · 2023-01-04T23:50:08Z

clarkbab
Jan 4, 2023
Author

Some more info on the problem for anyone interested...

I’m attempting to use FSDP for medical image segmentation to reduce GPU memory footprint during training. As a starting point, I’m trying to adapt the MONAI BRATS 3D segmentation tutorial to use FSDP with 2 GPUs.

I’ve created a fork of the original tutorial that instead spawns 2 processes, each of which begins a training loop with a module wrapped with FSDP. I had to split the fsdp_main function out into its own file due to this error.

from fsdp_main import fsdp_main
WORLD_SIZE = 2
mp.spawn(fsdp_main, args=(WORLD_SIZE,), nprocs=WORLD_SIZE, join=True)

I’m currently seeing this printed error message:

Expects tensor to be on the compute device cuda:1

And the following stack trace:

ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/baclark/venvs/medical-imaging/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/baclark/code/Project-MONAI-tutorials/3d_segmentation/fsdp_main.py", line 240, in fsdp_main
    outputs = model(inputs)
  File "/home/baclark/venvs/medical-imaging/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/baclark/venvs/medical-imaging/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2720, in forward
    self._pre_forward(self._handles, unshard_fn, unused, unused)
  File "/home/baclark/venvs/medical-imaging/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2756, in _pre_forward
    unshard_fn()
  File "/home/baclark/venvs/medical-imaging/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2768, in _pre_forward_unshard
    self._unshard(handles)
  File "/home/baclark/venvs/medical-imaging/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/baclark/venvs/medical-imaging/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1566, in _unshard
    handle.post_unshard()
  File "/home/baclark/venvs/medical-imaging/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 768, in post_unshard
    self._check_on_compute_device(self.flat_param)
  File "/home/baclark/venvs/medical-imaging/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 1064, in _check_on_compute_device
    p_assert(
  File "/home/baclark/venvs/medical-imaging/lib/python3.8/site-packages/torch/distributed/fsdp/_utils.py", line 149, in p_assert
    raise AssertionError
AssertionError

This seems to be occurring when the FSDP module unshards the parameters before performing the forward pass. For some reason this particular FlatParamHandle has been unsharded onto a different device than is stored in its self.device attribute.

I have checked that all input/label data and model parameters are on the correct devices before performing the forward pass.

The problem should be reproducible if you run the forked code with the following versions and update this directory to which the segmentation data is downloaded.

Any ideas why there might be a device mismatch for these flattened parameters?

CUDA                            11.7
python                          3.8.6
monai-weekly                    1.2.dev2252
torch                           1.13.1
torchaudio                      0.13.1
torchmetrics                    0.11.0
torchvision                     0.14.1

Thanks,
Brett

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FullyShardedDataParallel with 3D segmentation #1134

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

FullyShardedDataParallel with 3D segmentation #1134

Uh oh!

clarkbab Jan 4, 2023

Replies: 2 comments

Uh oh!

Nic-Ma Jan 4, 2023 Maintainer

Uh oh!

Uh oh!

clarkbab Jan 4, 2023 Author

clarkbab
Jan 4, 2023

Nic-Ma
Jan 4, 2023
Maintainer

clarkbab
Jan 4, 2023
Author