group convolution error when using DDP training #1992

AnonymousAccount6688 · 2023-10-17T14:47:25Z

AnonymousAccount6688
Oct 17, 2023

I tried to use group convolution with the following line of code:

dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=64)

But got the following error:

`
Using native Torch AMP. Training in mixed precision.
Using native Torch DistributedDataParallel.
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
main(args)
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
main(args)main(args)
main(args)

File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^ ^^ ^^ ^^^ ^^ ^^ ^^ ^ ^^ ^^ ^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^ File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
main(args)
_verify_param_shape_across_processes(self.process_group, parameters)_verify_param_shape_across_processes(self.process_group, parameters)

_verify_param_shape_across_processes(self.process_group, parameters)

File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^ ^ ^ ^ ^ ^ ^ ^ ^^ ^^ ^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^RuntimeError^: ^^
params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.^^^^
^^^^^^RuntimeError^^^: ^^params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
^
^^^^^^^^^^^^^^^RuntimeError^: ^params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.^
^^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022927 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022928 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022929 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022930 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022931 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 5 (pid: 1022932) of binary: /scratch365/ypeng4/software/bin/anaconda/envs/python311/bin/python

`

When I changed it to

dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=1)

Everything works fine.

Is there anything wrong with the DDP training of GroupConv?

rwightman · 2023-10-17T18:19:26Z

rwightman
Oct 17, 2023
Maintainer

@AnonymousAccount6688 DW models work fine for me, probably some modifications in the train script or added special cases for rank 0 that are breaking things

2 replies

AnonymousAccount6688 Oct 17, 2023
Author

Thank you for the reply.

I just tried to add a convolution after a Transformer Block. I added one line self.dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=1) in the __init__ and x = self.dw_conv(x), and everything works fine. I don't change anything but set groups=64, then I got the above error: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.

May I know which part I should modify?

AnonymousAccount6688 Oct 19, 2023
Author

This seems to be a problem with PyTorch's NativeDDP:

from torch.nn.parallel import DistributedDataParallel as NativeDDP

When I used the nvidia apex DDP, everything worked fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

group convolution error when using DDP training #1992

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

group convolution error when using DDP training #1992

Uh oh!

AnonymousAccount6688 Oct 17, 2023

Replies: 1 comment · 2 replies

Uh oh!

rwightman Oct 17, 2023 Maintainer

Uh oh!

AnonymousAccount6688 Oct 17, 2023 Author

Uh oh!

AnonymousAccount6688 Oct 19, 2023 Author

AnonymousAccount6688
Oct 17, 2023

Replies: 1 comment 2 replies

rwightman
Oct 17, 2023
Maintainer

AnonymousAccount6688 Oct 17, 2023
Author

AnonymousAccount6688 Oct 19, 2023
Author