Skip to content

Releases: Lightning-AI/pytorch-lightning

PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.

29 Mar 19:35
44e3edb
Compare
Choose a tag to compare

The core team is excited to announce the PyTorch Lightning 1.6 release ⚡

Highlights

PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:

Introducing Intel's Habana Accelerator

Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.

You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:

trainer = pl.Trainer(accelerator="hpu")

# single Gaudi training
trainer = pl.Trainer(accelerator="hpu", devices=1)

# distributed training with 8 Gaudi
trainer = pl.Trainer(accelerator="hpu", devices=8)

The Bagua Strategy

The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:

trainer = pl.Trainer(strategy="bagua")

# or to choose a custom algorithm
trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce")  # default

Towards stable Accelerator, Strategy, and Plugin APIs

The Accelerator, Strategy, and Plugin APIs are a core part of PyTorch Lightning. They're where all the distributed boilerplate lives, and we're constantly working to improve both them and the overall PyTorch Lightning platform experience.

In this release, we've made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (TrainingTypePlugin) as well as certain Plugins. In particular, we want to highlight the following changes:

  • All TrainingTypePlugins have been renamed to Strategy (#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the new strategy and devices flags to the Trainer.

    # Before
    from pytorch_lightning.plugins import DDPPlugin
    
    # New
    from pytorch_lightning.strategies import DDPStrategy
  • The Accelerator and PrecisionPlugin have moved into Strategy. All strategies now take an optional parameter accelerator and precision_plugin (#11022, #10570).

  • Custom Accelerator implementations must now implement two new abstract methods: is_available() (#11797) and auto_device_count() (#10222). The latter determines how many devices get used by default when specifying Trainer(accelerator=..., devices="auto").

  • We redesigned the process creation for spawn-based strategies such as DDPSpawnStrategy and TPUSpawnStrategy (#10896). All spawn-based strategies now spawn processes immediately upon calling Trainer.{fit,validate,test,predict}, which means the hooks/callbacks prepare_data, setup, configure_sharded_model and teardown all run under an initialized process group. These changes align the spawn-based strategies with their non-spawn counterparts (such as DDPStrategy).

We've also exposed the process group backend for use. For example, you can now easily enable fairring like this:

# Explicitly specify the process group backend if you choose to
ddp = pl.strategies.DDPStrategy(process_group_backend="fairring")
trainer = Trainer(strategy=ddp, accelerator="gpu", devices=8)

In a similar fashion, if installing torch>=1.11, you can enable DDP static graph to apply special runtime optimizations:

trainer = Trainer(devices=4, strategy=DDPStrategy(static_graph=True))

LightningCLI improvements

In the previous release, we added shorthand notation support for registered components. In this release, we added a flag to automatically register all available components:

from pytorch_lightning.utilities.cli import LightningCLI

LightningCLI(auto_registry=True)

We have also added support for the ReduceLROnPlateau scheduler with shorthand notation:

$ python script.py fit --optimizer=Adam --lr_scheduler=ReduceLROnPlateau --lr_scheduler.monitor=metric_to_track

If you need to customize the learning rate scheduler configuration, you can do so by overriding:

class MyLightningCLI(LightningCLI):
    @staticmethod
    def configure_optimizers(lightning_module, optimizer, lr_scheduler=None):
        return {"optimizer": optimizer, "lr_scheduler": {"scheduler": lr_scheduler, ...}}

Finally, loggers are also now configurable with shorthand:

$ python script.py fit --trainer.logger=WandbLogger --trainer.logger.name="my_lightning_run"

Control SLURM's re-queueing

We've added the ability to turn the automatic resubmission on or off when a job gets interrupted by the SLURM controller (via signal handling). Users who prefer to let their code handle the resubmission (for example, when submitit is used) can now pass:

from pytorch_lightning.plugins.environments import SLURMEnvironment

trainer = pl.Trainer(plugins=SLURMEnvironment(auto_requeue=False))

Fault-tolerance improvements

The Fault-tolerance training under manual optimization now tracks optimization progress. We also changed the graceful exit signal from SIGUSR1 to SIGTERM for better support inside cloud instances.
An additional feature we're excited to announce is support for consecutive trainer.fit() calls.

trainer = pl.Trainer(max_epochs=2)
trainer.fit(model)

# now, run 2 more epochs
trainer.fit_loop.max_epochs = 4
trainer.fit(model)

Loop customization improvements

The Loop's state is now included as part of the checkpoints saved by the library. This enables finer restoration of custom loops.

We've also made it easier to replace Lightning's loops with your own. For example:

class MyCustomLoop(pl.loops.TrainingEpochLoop):
    ...

trainer = pl.Trainer(...)
trainer.fit_loop.replace(epoch_loop=MyCustomLoop)
# Trainer runs the fit loop with your new epoch loop!
trainer.fit(model)

Data-Loading improvements

In previous versions, Lightning required that the DataLoader instance set its input arguments as instance attributes. This meant that custom DataLoaders also had this hidden requirement. In this release, we do this automatically for the user, easing the passing of custom loaders:

class MyDataLoader(torch.utils.data.DataLoader):
    def __init__(self, a=123, *args, **kwargs):
-       # this was required before
-       self.a = a
        super().__init__(*args, **kwargs)

trainer.fit(model, train_dataloader=MyDataLoader())

As of this release, Lightning no longer pre-fetches 1 extra batch if it doesn't need to. Previously, doing so would conflict with the internal pre-fetching done by optimized data loaders such as FFCV's. You can now define your own pre-fetching value like this:

class MyCustomLoop(pl.loops.FitLoop):
    @property
    def prefetch_batches(self):
        return 7  # lucky number 7

trainer = pl.Trainer(...)
trainer.fit_loop = MyCustomLoop(min_epochs=trainer.min_epochs, max_epochs=trainer.max_epochs)

New Hooks

LightningModule.lr_scheduler_step

Lightning now allows the use of custom learning rate schedulers that aren't natively available in PyTorch. A great example of this is Timm Schedulers.

When using custom learning rate schedulers relying on an API other than PyTorch's, you can now define the LightningModule.lr_scheduler_step with your desired logic.

from timm.scheduler import TanhLRScheduler


class MyLightningModule(pl.LightningModule):
    def configure_optimizers(self):...
Read more

Standard weekly patch release

09 Feb 20:42
Compare
Choose a tag to compare

[1.5.10] - 2022-02-08

Fixed

  • Fixed an issue to avoid validation loop run on restart (#11552)
  • The Rich progress bar now correctly shows the on_epoch logged values on train epoch end (#11689)
  • Fixed an issue to make the step argument in WandbLogger.log_image work (#11716)
  • Fixed restore_optimizers for mapping states (#11757)
  • With DPStrategy, the batch is not explicitly moved to the device (#11780)
  • Fixed an issue to avoid val bar disappear after trainer.validate() (#11700)
  • Fixed supporting remote filesystems with Trainer.weights_save_path for fault-tolerant training (#11776)
  • Fixed check for available modules (#11526)
  • Fixed bug where the path for "last" checkpoints was not getting saved correctly which caused newer runs to not remove the previous "last" checkpoint (#11481)
  • Fixed bug where the path for best checkpoints was not getting saved correctly when no metric was monitored which caused newer runs to not use the best checkpoint (#11481)

Contributors

@ananthsub @Borda @circlecrystal @NathanGodey @nithinraok @rohitgr7

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

20 Jan 19:48
Compare
Choose a tag to compare

[1.5.9] - 2022-01-20

Fixed

  • Pinned sphinx-autodoc-typehints with <v1.15 (#11400)
  • Skipped testing with PyTorch 1.7 and Python 3.9 on Ubuntu (#11217)
  • Fixed type promotion when tensors of higher category than float are logged (#11401)
  • Fixed the format of the configuration saved automatically by the CLI's SaveConfigCallback (#11532)

Changed

  • Changed LSFEnvironment to use LSB_DJOB_RANKFILE environment variable instead of LSB_HOSTS for determining node rank and main address (#10825)
  • Disabled sampler replacement when using IterableDataset (#11507)

Contributors

@ajtritt @akihironitta @carmocca @rohitgr7

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

05 Jan 15:23
Compare
Choose a tag to compare

[1.5.8] - 2022-01-05

Fixed

  • Fixed LightningCLI race condition while saving the config (#11199)
  • Fixed the default value used with log(reduce_fx=min|max) (#11310)
  • Fixed data fetcher selection (#11294)
  • Fixed a race condition that could result in incorrect (zero) values being observed in prediction writer callbacks (#11288)
  • Fixed dataloaders not getting reloaded the correct amount of times when setting reload_dataloaders_every_n_epochs and check_val_every_n_epoch (#10948)

Contributors

@adamviola @akihironitta @awaelchli @Borda @carmocca @edpizzi

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

21 Dec 18:33
Compare
Choose a tag to compare

[1.5.7] - 2021-12-21

Fixed

  • Fixed NeptuneLogger when using DDP (#11030)
  • Fixed a bug to disable logging hyperparameters in logger if there are no hparams (#11105)
  • Avoid the deprecated onnx.export(example_outputs=...) in torch 1.10 (#11116)
  • Fixed an issue when torch-scripting a LightningModule after training with Trainer(sync_batchnorm=True) (#11078)
  • Fixed an AttributeError occuring when using a CombinedLoader (multiple dataloaders) for prediction (#11111)
  • Fixed bug where Trainer(track_grad_norm=..., logger=False) would fail (#11114)
  • Fixed an incorrect warning being produced by the model summary when using bf16 precision on CPU (#11161)

Changed

  • DeepSpeed does not require lightning module zero 3 partitioning (#10655)
  • The ModelCheckpoint callback now saves and restores attributes best_k_models, kth_best_model_path, kth_value, and last_model_path (#10995)

Contributors

@awaelchli @borchero @carmocca @guyang3532 @kaushikb11 @ORippler @Raalsky @rohitgr7 @SeanNaren

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

15 Dec 23:06
Compare
Choose a tag to compare

[1.5.6] - 2021-12-15

Fixed

  • Fixed a bug where the DeepSpeedPlugin arguments cpu_checkpointing and contiguous_memory_optimization were not being forwarded to deepspeed correctly (#10874)
  • Fixed an issue with NeptuneLogger causing checkpoints to be uploaded with a duplicated file extension (#11015)
  • Fixed support for logging within callbacks returned from LightningModule (#10991)
  • Fixed running sanity check with RichProgressBar (#10913)
  • Fixed support for CombinedLoader while checking for warning raised with eval dataloaders (#10994)
  • The TQDM progress bar now correctly shows the on_epoch logged values on train epoch end (#11069)
  • Fixed bug where the TQDM updated the training progress bar during trainer.validate (#11069)

Contributors

@carmocca @jona-0 @kaushikb11 @Raalsky @rohitgr7

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

07 Dec 15:35
Compare
Choose a tag to compare

[1.5.5] - 2021-12-07

Fixed

  • Disabled batch_size extraction for torchmetric instances because they accumulate the metrics internally (#10815)
  • Fixed an issue with SignalConnector not restoring the default signal handlers on teardown when running on SLURM or with fault-tolerant training enabled (#10611)
  • Fixed SignalConnector._has_already_handler check for callable type (#10483)
  • Fixed an issue to return the results for each dataloader separately instead of duplicating them for each (#10810)
  • Improved exception message if rich version is less than 10.2.2 (#10839)
  • Fixed uploading best model checkpoint in NeptuneLogger (#10369)
  • Fixed early schedule reset logic in PyTorch profiler that was causing data leak (#10837)
  • Fixed a bug that caused incorrect batch indices to be passed to the BasePredictionWriter hooks when using a dataloader with num_workers > 0 (#10870)
  • Fixed an issue with item assignment on the logger on rank > 0 for those who support it (#10917)
  • Fixed importing torch_xla.debug for torch-xla<1.8 (#10836)
  • Fixed an issue with DDPSpawnPlugin and related plugins leaving a temporary checkpoint behind (#10934)
  • Fixed a TypeError occuring in the SingalConnector.teardown() method (#10961)

Contributors

@awaelchli @carmocca @four4fish @kaushikb11 @lucmos @mauvilsa @Raalsky @rohitgr7

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

30 Nov 14:41
Compare
Choose a tag to compare

[1.5.4] - 2021-11-30

Fixed

  • Fixed support for --key.help=class with the LightningCLI (#10767)
  • Fixed _compare_version for python packages (#10762)
  • Fixed TensorBoardLogger SummaryWriter not close before spawning the processes (#10777)
  • Fixed a consolidation error in Lite when attempting to save the state dict of a sharded optimizer (#10746)
  • Fixed the default logging level for batch hooks associated with training from on_step=False, on_epoch=True to on_step=True, on_epoch=False (#10756)

Removed

Contributors

@awaelchli @carmocca @kaushikb11 @rohitgr7 @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

24 Nov 15:40
Compare
Choose a tag to compare

[1.5.3] - 2021-11-24

Fixed

  • Fixed ShardedTensor state dict hook registration to check if torch distributed is available (#10621)
  • Fixed an issue with self.log not respecting a tensor's dtype when applying computations (#10076)
  • Fixed LigtningLite _wrap_init popping unexisting keys from DataLoader signature parameters (#10613)
  • Fixed signals being registered within threads (#10610)
  • Fixed an issue that caused Lightning to extract the batch size even though it was set by the user in LightningModule.log (#10408)
  • Fixed Trainer(move_metrics_to_cpu=True) not moving the evaluation logged results to CPU (#10631)
  • Fixed the {validation,test}_step outputs getting moved to CPU with Trainer(move_metrics_to_cpu=True) (#10631)
  • Fixed signals being registered within threads (#10610)
  • Fixed an issue with collecting logged test results with multiple dataloaders (#10522)

Contributors

@ananthsub @awaelchli @carmocca @jiwidi @kaushikb11 @qqueing @rohitgr7 @shabie @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

16 Nov 19:20
Compare
Choose a tag to compare

[1.5.2] - 2021-11-16

Fixed

  • Fixed CombinedLoader and max_size_cycle didn't receive a DistributedSampler (#10374)
  • Fixed an issue where class or init-only variables of dataclasses were passed to the dataclass constructor in utilities.apply_to_collection (#9702)
  • Fixed isinstance not working with init_meta_context, materialized model not being moved to the device (#10493)
  • Fixed an issue that prevented the Trainer to shutdown workers when execution is interrupted due to failure(#10463)
  • Squeeze the early stopping monitor to remove empty tensor dimensions (#10461)
  • Fixed sampler replacement logic with overfit_batches to only replace the sample when SequentialSampler is not used (#10486)
  • Fixed scripting causing false positive deprecation warnings (#10470, #10555)
  • Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
  • Fixed propagation of device and dtype information to submodules of LightningLite when they inherit from DeviceDtypeModuleMixin (#10559)

Contributors

@a-gardner1 @awaelchli @carmocca @justusschock @Raahul-Singh @rohitgr7 @SeanNaren @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]