Skip to content

Releases: Lightning-AI/pytorch-lightning

Week bugfix release

09 Aug 13:36
Compare
Choose a tag to compare

[0.5.5] - 2022-08-9

Deprecated

  • Deprecate sheety API (#14004)

Fixed

  • Resolved a bug where the work statuses will grow quickly and be duplicated (#13970)
  • Resolved a bug about a race condition when sending the work state through the caller_queue (#14074)
  • Fixed Start Lightning App on Cloud if Repo Begins With Name "Lightning" (#14025)

Contributors

@manskx, @rlizzo, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

PyTorch Lightning 1.7: Apple Silicon support, Native FSDP, Collaborative training, and multi-GPU support with Jupyter notebooks

02 Aug 16:21
d2c086b
Compare
Choose a tag to compare

The core team is excited to announce the release of PyTorch Lightning 1.7 ⚡

PyTorch Lightning 1.7 is the culmination of work from 106 contributors who have worked on features, bug-fixes, and documentation for a total of over 492 commits since 1.6.0.

Highlights

Apple Silicon Support

For those using PyTorch 1.12 on M1 or M2 Apple machines, we have created the MPSAccelerator. MPSAccelerator enables accelerated GPU training on Apple’s Metal Performance Shaders (MPS) as a backend process.


NOTE

Support for this accelerator is currently marked as experimental in PyTorch. Because many operators are still missing, you may run into a few rough edges.


# Selects the accelerator
trainer = pl.Trainer(accelerator="mps")

# Equivalent to
from pytorch_lightning.accelerators import MPSAccelerator
trainer = pl.Trainer(accelerator=MPSAccelerator())

# Defaults to "mps" when run on M1 or M2 Apple machines
# to avoid code changes when switching computers
trainer = pl.Trainer(accelerator="gpu")

Native Fully Sharded Data Parallel Strategy

PyTorch 1.12 also added native support for Fully Sharded Data Parallel (FSDP). Previously, PyTorch Lightning enabled this by using the fairscale project. You can now choose between both options.


NOTE

Support for this strategy is marked as beta in PyTorch.


# Native PyTorch implementation
trainer = pl.Trainer(strategy="fsdp_native")

# Equivalent to
from pytorch_lightning.strategies import DDPFullyShardedNativeStrategy
trainer = pl.Trainer(strategy=DDPFullyShardedNativeStrategy())

# For reference, FairScale's implementation can be used with
trainer = pl.Trainer(strategy="fsdp")

A Collaborative Training strategy using Hivemind

Collaborative Training solves the need for top-tier multi-GPU servers by allowing you to train across unreliable machines such as local ones or even preemptible cloud compute across the Internet.

Under the hood, we use Hivemind. This provides de-centralized training across the Internet.

from pytorch_lightning.strategies import HivemindStrategy

trainer = pl.Trainer(
    strategy=HivemindStrategy(target_batch_size=8192), 
    accelerator="gpu", 
    devices=1
)

For more information, check out the docs.

Distributed support in Jupyter Notebooks

So far, the only multi-GPU strategy supported in Jupyter notebooks (including Grid.ai, Google Colab, and Kaggle, for example) has been the Data-Parallel (DP) strategy (strategy="dp"). DP, however, has several limitations that often obstruct users' workflows. It can be slow, it's incompatible with TorchMetrics, it doesn't persist state changes on replicas, and it's difficult to use with non-primitive input- and output structures.

In this release, we've added support for Distributed Data Parallel in Jupyter notebooks using the fork mechanism to address these shortcomings. This is only available for MacOS and Linux (sorry Windows!).


NOTE

This feature is experimental.


This is how you use multi-device in notebooks now:

# Train on 2 GPUs in a Jupyter notebook
trainer = pl.Trainer(accelerator="gpu", devices=2)

# Can be set explicitly
trainer = pl.Trainer(accelerator="gpu", devices=2, strategy="ddp_notebook")

# Can also be used in non-interactive environments
trainer = pl.Trainer(accelerator="gpu", devices=2, strategy="ddp_fork")

By default, the Trainer detects the interactive environment and selects the right strategy for you. Learn more in the full documentation.

Versioning of "last" checkpoints

If a run is configured to save to the same directory as a previous run and ModelCheckpoint(save_last=True) is enabled, the "last" checkpoint is now versioned with a simple -v1 suffix to avoid overwriting the existing "last" checkpoint. This mimics the behaviour for checkpoints that monitor a metric.

Automatically reload the "last" checkpoint

In certain scenarios, like when running in a cloud spot instance with fault-tolerant training enabled, it is useful to load the latest available checkpoint. It is now possible to pass the string ckpt_path="last" in order to load the latest available checkpoint from the set of existing checkpoints.

trainer = Trainer(...)
trainer.fit(..., ckpt_path="last")

Validation every N batches across epochs

In some cases, for example iteration based training, it is useful to run validation after every N number of training batches without being limited by the epoch boundary. Now, you can enable validation based on total training batches.

trainer = Trainer(..., val_check_interval=N, check_val_every_n_epoch=None)
trainer.fit(...)

For example, given 5 epochs of 10 batches, setting N=25 would run validation in the 3rd and 5th epoch.

CPU stats monitoring

PyTorch Lightning provides the DeviceStatsMonitor callback to monitor the stats of the hardware currently used. However, users often also want to monitor the stats of other hardware. In this release, we have added an option to additionally monitor CPU stats:

from pytorch_lightning.callbacks import DeviceStatsMonitor

# Log both CPU stats and GPU stats
trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=True), accelerator="gpu")

# Log just the GPU stats
trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=False), accelerator="gpu")

# Equivalent to `DeviceStatsMonitor()`
trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=True), accelerator="cpu")

The CPU stats are gathered using the psutil package.

Automatic distributed samplers

It is now possible to use custom samplers in a distributed environment without the need to set replace_ddp_sampler=False and wrap your sampler manually with the DistributedSampler.

Inference mode support

PyTorch 1.9 introduced torch.inference_mode, which is a faster alternative for torch.no_grad. Lightning will now use inference_mode wherever possible during evaluation.

Support for warn-level determinism

In Pytorch 1.11, operations that do not have a deterministic implementation can be set to throw a warning instead of an error when ran in deterministic mode. This is now supported by our Trainer:

trainer = pl.Trainer(deterministic="warn")

LightningCLI improvements

After the latest updates to jsonargparse, the library supporting the LightningCLI, there's now complete support for shorthand notation. This includes automatic support for shorthand notation to all arguments, not just the ones that are part of the registries, plus support inside configuration files.

+ # pytorch_lightning==1.7.0
  trainer:
  callbacks:
-   - class_path: pytorch_lightning.callbacks.EarlyStopping
+   - class_path: EarlyStopping
      init_args:
        monitor: "loss"

A header with the version that generated the config is now included.

All subclasses for a given base class can be specified by name, so there's no need to explicitly register them. The only requirement is that the module where the subclass is defined is imported prior to parsing.

from pytorch_lightning.cli import LightningCLI
import my_code.models
import my_code.optimizers

cli = LightningCLI()
# Now use any of the classes:
# python trainer.py fit --model=Model1 --optimizer=CustomOptimizer

The new version renders the registries and the auto_registry flag, introduced in 1.6.0, unnecessary, so we have deprecated them.

Support was also added for list appending; for example, to add a callback to an existing list that might be already configured:

$ python trainer.py fit \
-   --trainer.callbacks=EarlyStopping \
+   --trainer.callbacks+=EarlyStopping \
    --trainer.callbacks.patience=5 \
-   --trainer.callbacks=LearningRateMonitor \
+   --trainer.callbacks+=LearningRateMonitor \
    --trainer.callbacks.logging_interval=epoch

Callback registration through entry points

Entry Points are an advanced feature in Python's setuptools that allow packages to expose metadata to other packages. In Lightning, we ...

Read more

Build-in templates

01 Aug 14:39
99fce3b
Compare
Choose a tag to compare

[0.5.4] - 2022-08-01

Changed

  • Wrapped imports for traceability (#13924)
  • Set version as today (#13906)

Fixed

  • Included app templates to the lightning and app packages (#13731)
  • Added UI for installing it all (#13732)
  • Fixed build meta pkg flow (#13926)

Contributors

@Borda, @manskx

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Minor bug-fix release

26 Jul 17:11
Compare
Choose a tag to compare

[0.5.3] - 2022-07-25

Changed

  • Pruned requirements duplicity (#13739)

Fixed

  • Use correct python version in lightning component template (#13790)

Lightning App 0.5.2

18 Jul 16:36
Compare
Choose a tag to compare

[0.5.2] - 2022-07-18

Added

  • Update the Lightning App docs (#13537)

Changed

  • Added LIGHTNING_ prefix to Platform AWS credentials (#13703)

PyTorch Lightning 1.6.5: Standard patch release

13 Jul 00:26
ff53616
Compare
Choose a tag to compare

[1.6.5] - 2022-07-13

Fixed

  • Fixed estimated_stepping_batches requiring distributed comms in configure_optimizers for the DeepSpeedStrategy (#13350)
  • Fixed bug with Python version check that prevented use with development versions of Python (#13420)
  • The loops now call .set_epoch() also on batch samplers if the dataloader has one wrapped in a distributed sampler (#13396)
  • Fixed the restoration of log step during restart (#13467)

Contributors

@adamjstewart @akihironitta @awaelchli @Borda @martinosorb @rohitgr7 @SeanNaren

PyTorch Lightning 1.6.4: Standard patch release

01 Jun 14:32
74b1317
Compare
Choose a tag to compare

[1.6.4] - 2022-06-01

Added

  • Added all DDP params to be exposed through hpu parallel strategy (#13067)

Changed

  • Keep torch.backends.cudnn.benchmark=False by default (unlike in v1.6.{0-4}) after speed and memory problems depending on the data used. Please consider tuning Trainer(benchmark) manually. (#13154)
  • Prevent modification of torch.backends.cudnn.benchmark when Trainer(benchmark=...) is not set (#13154)

Fixed

  • Fixed an issue causing zero-division error for empty dataloaders (#12885)
  • Fixed mismatching default values for the types of some arguments in the DeepSpeed and Fully-Sharded strategies which made the CLI unable to use them (#12989)
  • Avoid redundant callback restore warning while tuning (#13026)
  • Fixed Trainer(precision=64) during evaluation which now uses the wrapped precision module (#12983)
  • Fixed an issue to use wrapped LightningModule for evaluation during trainer.fit for BaguaStrategy (#12983)
  • Fixed an issue wrt unnecessary usage of habana mixed precision package for fp32 types (#13028)
  • Fixed the number of references of LightningModule so it can be deleted (#12897)
  • Fixed materialize_module setting a module's child recursively (#12870)
  • Fixed issue where the CLI could not pass a Profiler to the Trainer (#13084)
  • Fixed torchelastic detection with non-distributed installations (#13142)
  • Fixed logging's step values when multiple dataloaders are used during evaluation (#12184)
  • Fixed epoch logging on train epoch end (#13025)
  • Fixed DDPStrategy and DDPSpawnStrategy to initialize optimizers only after moving the module to the device (#11952)

Contributors

@akihironitta @ananthsub @ar90n @awaelchli @Borda @carmocca @dependabot @jerome-habana @mads-oestergaard @otaj @rohitgr7

PyTorch Lightning 1.6.3: Standard patch release

03 May 20:36
Compare
Choose a tag to compare

[1.6.3] - 2022-05-03

Fixed

  • Use only a single instance of rich.console.Console throughout codebase (#12886)
  • Fixed an issue to ensure all the checkpoint states are saved in a common filepath with DeepspeedStrategy (#12887)
  • Fixed trainer.logger deprecation message (#12671)
  • Fixed an issue where sharded grad scaler is passed in when using BF16 with the ShardedStrategy (#12915)
  • Fixed an issue wrt recursive invocation of DDP configuration in hpu parallel plugin (#12912)
  • Fixed printing of ragged dictionaries in Trainer.validate and Trainer.test (#12857)
  • Fixed threading support for legacy loading of checkpoints (#12814)
  • Fixed pickling of KFoldLoop (#12441)
  • Stopped optimizer_zero_grad from being called after IPU execution (#12913)
  • Fixed fuse_modules to be qat-aware for torch>=1.11 (#12891)
  • Enforced eval shuffle warning only for default samplers in DataLoader (#12653)
  • Enable mixed precision in DDPFullyShardedStrategy when precision=16 (#12965)
  • Fixed TQDMProgressBar reset and update to show correct time estimation (#12889)
  • Fixed fit loop restart logic to enable resume using the checkpoint (#12821)

Contributors

@akihironitta @carmocca @hmellor @jerome-habana @kaushikb11 @krshrimali @mauvilsa @niberger @ORippler @otaj @rohitgr7 @SeanNaren

PyTorch Lightning 1.6.2: Standard patch release

27 Apr 17:04
Compare
Choose a tag to compare

[1.6.2] - 2022-04-27

Fixed

  • Fixed ImportError when torch.distributed is not available. (#12794)
  • When using custom DataLoaders in LightningDataModule, multiple inheritance is resolved properly (#12716)
  • Fixed encoding issues on terminals that do not support unicode characters (#12828)
  • Fixed support for ModelCheckpoint monitors with dots (#12783)

Contributors

@akihironitta @alvitawa @awaelchli @Borda @carmocca @code-review-doctor @ethanfurman @HenryLau0220 @krshrimali @otaj

PyTorch Lightning 1.6.1: Standard weekly patch release

13 Apr 18:30
Compare
Choose a tag to compare

[1.6.1] - 2022-04-13

Changed

  • Support strategy argument being case insensitive (#12528)

Fixed

  • Run main progress bar updates independent of val progress bar updates in TQDMProgressBar (#12563)
  • Avoid calling average_parameters multiple times per optimizer step (#12452)
  • Properly pass some Logger's parent's arguments to super().__init__() (#12609)
  • Fixed an issue where incorrect type warnings appear when the overridden LightningLite.run method accepts user-defined arguments (#12629)
  • Fixed rank_zero_only decorator in LSF environments (#12587)
  • Don't raise a warning when nn.Module is not saved under hparams (#12669)
  • Raise MisconfigurationException when the accelerator is available but the user passes invalid ([]/0/"0") values to the devices flag (#12708)
  • Support auto_select_gpus with the accelerator and devices API (#12608)

Contributors

@akihironitta @awaelchli @Borda @carmocca @kaushikb11 @krshrimali @mauvilsa @otaj @pre-commit-ci @rohitgr7 @semaphore-egg @tkonopka @wayi1

If we forgot someone due to not matching the commit email with the GitHub account, let us know :]