How are metrics aggregated in DDP? #2109

ZhaofengWu · 2020-06-08T05:24:44Z

ZhaofengWu
Jun 8, 2020

I see that the LightningTemplateModel in 0.7.5 (no longer the case in master) manually averages the metrics in validation_epoch_end for DP and DDP2
https://github.com/PyTorchLightning/pytorch-lightning/blob/694f1d789dfa56b365b68dd4f3c6f5f7a4c8970a/pl_examples/models/lightning_template.py#L167-L168

But what about DDP? I get that each device can have its own loss for backward, but we want only one single metric across devices. How is that achieved? (Is averaging the best way to aggregate most metrics anyway?)

ZhaofengWu · 2020-06-09T22:09:23Z

ZhaofengWu
Jun 9, 2020
Author

My guess was that only the train dataloader uses DistributedSampler but not val/test. In other words each process evals the entire val/test sets and only rank 0 reports (e.g. logs) the metrics. Apparently this used to be the case but #1192 changed val/test sets to use DistributedSampler too. So I think some aggregation must be done?

0 replies

Borda · 2020-06-10T22:31:07Z

Borda
Jun 10, 2020
Maintainer

@alexeykarnachev mind have a look, pls ^^

0 replies

ZhaofengWu · 2020-06-11T02:38:31Z

ZhaofengWu
Jun 11, 2020
Author

I found this
https://github.com/PyTorchLightning/pytorch-lightning/blob/bd49b07fbba09b1e7d8851ee5a1ffce3d5925e9e/pytorch_lightning/metrics/metric.py#L46-L54
But if I don't want the overhead of creating a class for simple one-liner metrics, and/or have metrics that can't be easily reduced, is there a way to let dev/test dataloaders to load the entire datasets like pre-#1192? The only way I can think of is to set replace_sampler_ddp=False and manually add the DistributedSampler to the training dataloader with something like

def load_dataset(self, mode, batch_size):
  ...
  if mode == 'train':
    self.trainer.replace_sampler_ddp = True
    dataloader = self.trainer.auto_add_sampler(dataloader, True)
    self.trainer.replace_sampler_ddp = False
  return dataloader

This feels kind of hacky though. If there's an option like replace_evaluation_sampler_ddp it would be much more straightforward.

0 replies

alexeykarnachev · 2020-06-11T09:18:16Z

alexeykarnachev
Jun 11, 2020

@ZhaofengWu could you, please provide a min. runnable script, which represents the problem?

0 replies

ZhaofengWu · 2020-06-11T18:23:49Z

ZhaofengWu
Jun 11, 2020
Author

Sorry but it's not a problem/bug in the code. It's just a question: what's the proper way to aggregate metrics under DDP if we don't want the overhead of subclassing the TensorMetric mentioned above. If "letting dev/test dataloaders read the entire datasets" is the answer, what's the best way to do that?

0 replies

aaronma2020 · 2020-06-20T12:24:34Z

aaronma2020
Jun 20, 2020

I have the same problem

0 replies

2020-08-19T12:41:33Z

stale[bot]
bot Aug 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

0 replies

YassineYousfi · 2020-08-22T04:08:57Z

YassineYousfi
Aug 22, 2020

My current workaround is to use pl.metrics.converters._sync_ddp_if_available.

You can also use the pl.metrics.converters.sync_ddp decorator, but this means your metric will sync at each forward pass.

Actually, the lightning_template (https://github.com/PyTorchLightning/pytorch-lightning/blob/7cca3859a7b97a9ab4a6c6fb5f36ff94bff7f218/pl_examples/models/lightning_template.py) doesn't subclass Metric, which - if I understand correctly - means that it only logs metrics on rank = 0, same for the loss.

0 replies

Borda · 2020-09-15T18:45:46Z

Borda
Sep 15, 2020
Maintainer

@aaronma2020 mind provide minimal running example?

0 replies

jandonov · 2021-03-30T23:45:19Z

jandonov
Mar 30, 2021

I have the same issue, thoroughly explained in: #6501.

0 replies

How are metrics aggregated in DDP? #2109

Uh oh!

Uh oh!

ZhaofengWu Jun 8, 2020

Replies: 10 comments

Uh oh!

ZhaofengWu Jun 9, 2020 Author

Uh oh!

Borda Jun 10, 2020 Maintainer

Uh oh!

Uh oh!

ZhaofengWu Jun 11, 2020 Author

Uh oh!

alexeykarnachev Jun 11, 2020

Uh oh!

ZhaofengWu Jun 11, 2020 Author

Uh oh!

aaronma2020 Jun 20, 2020

Uh oh!

stale[bot] bot Aug 19, 2020

Uh oh!

YassineYousfi Aug 22, 2020

Uh oh!

Borda Sep 15, 2020 Maintainer

Uh oh!

jandonov Mar 30, 2021

ZhaofengWu
Jun 8, 2020

ZhaofengWu
Jun 9, 2020
Author

Borda
Jun 10, 2020
Maintainer

ZhaofengWu
Jun 11, 2020
Author

alexeykarnachev
Jun 11, 2020

ZhaofengWu
Jun 11, 2020
Author

aaronma2020
Jun 20, 2020

stale[bot]
bot Aug 19, 2020

YassineYousfi
Aug 22, 2020

Borda
Sep 15, 2020
Maintainer

jandonov
Mar 30, 2021