group convolution error when using DDP training #1992
Unanswered
AnonymousAccount6688
asked this question in
Contributing
Replies: 1 comment 2 replies
-
@AnonymousAccount6688 DW models work fine for me, probably some modifications in the train script or added special cases for rank 0 that are breaking things |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I tried to use group convolution with the following line of code:
dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=64)
But got the following error:
`
Using native Torch AMP. Training in mixed precision.
Using native Torch DistributedDataParallel.
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
main(args)
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
main(args)main(args)
main(args)
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^ ^^ ^^ ^^^ ^^ ^^ ^^ ^ ^^ ^^ ^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^ File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
main(args)
_verify_param_shape_across_processes(self.process_group, parameters)_verify_param_shape_across_processes(self.process_group, parameters)
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^ ^ ^ ^ ^ ^ ^ ^ ^^ ^^ ^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^RuntimeError^: ^^
params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.^^^^
^^^^^^RuntimeError^^^: ^^params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
^
^^^^^^^^^^^^^^^RuntimeError^: ^params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.^
^^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022927 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022928 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022929 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022930 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022931 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 5 (pid: 1022932) of binary: /scratch365/ypeng4/software/bin/anaconda/envs/python311/bin/python
`
When I changed it to
dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=1)
Everything works fine.
Is there anything wrong with the DDP training of GroupConv?
Beta Was this translation helpful? Give feedback.
All reactions