How are metrics aggregated in DDP? #2109
Replies: 10 comments
-
My guess was that only the train dataloader uses |
Beta Was this translation helpful? Give feedback.
-
@alexeykarnachev mind have a look, pls ^^ |
Beta Was this translation helpful? Give feedback.
-
I found this
This feels kind of hacky though. If there's an option like |
Beta Was this translation helpful? Give feedback.
-
@ZhaofengWu could you, please provide a min. runnable script, which represents the problem? |
Beta Was this translation helpful? Give feedback.
-
Sorry but it's not a problem/bug in the code. It's just a question: what's the proper way to aggregate metrics under DDP if we don't want the overhead of subclassing the |
Beta Was this translation helpful? Give feedback.
-
I have the same problem |
Beta Was this translation helpful? Give feedback.
-
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Beta Was this translation helpful? Give feedback.
-
My current workaround is to use You can also use the Actually, the lightning_template (https://github.com/PyTorchLightning/pytorch-lightning/blob/7cca3859a7b97a9ab4a6c6fb5f36ff94bff7f218/pl_examples/models/lightning_template.py) doesn't subclass Metric, which - if I understand correctly - means that it only logs metrics on rank = 0, same for the loss. |
Beta Was this translation helpful? Give feedback.
-
@aaronma2020 mind provide minimal running example? |
Beta Was this translation helpful? Give feedback.
-
I have the same issue, thoroughly explained in: #6501. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I see that the
LightningTemplateModel
in0.7.5
(no longer the case inmaster
) manually averages the metrics invalidation_epoch_end
for DP and DDP2https://github.com/PyTorchLightning/pytorch-lightning/blob/694f1d789dfa56b365b68dd4f3c6f5f7a4c8970a/pl_examples/models/lightning_template.py#L167-L168
But what about DDP? I get that each device can have its own loss for backward, but we want only one single metric across devices. How is that achieved? (Is averaging the best way to aggregate most metrics anyway?)
Beta Was this translation helpful? Give feedback.
All reactions