Skip to content

Releases: Lightning-AI/pytorch-lightning

Standard weekly patch release

09 Nov 19:57
Compare
Choose a tag to compare

[1.5.1] - 2021-11-09

Fixed

  • Fixed apply_to_collection(defaultdict) (#10316)
  • Fixed failure when DataLoader(batch_size=None) is passed (#10345)
  • Fixed interception of __init__ arguments for sub-classed DataLoader re-instantiation in Lite (#10334)
  • Fixed issue with pickling CSVLogger after a call to CSVLogger.save (#10388)
  • Fixed an import error being caused by PostLocalSGD when torch.distributed not available (#10359)
  • Fixed the logging with on_step=True in epoch-level hooks causing unintended side-effects. Logging with on_step=True in epoch-level hooks will now correctly raise an error (#10409)
  • Fixed deadlocks for distributed training with RichProgressBar (#10428)
  • Fixed an issue where the model wrapper in Lite converted non-floating point tensors to float (#10429)
  • Fixed an issue with inferring the dataset type in fault-tolerant training (#10432)
  • Fixed dataloader workers with persistent_workers being deleted on every iteration (#10434)

Contributors

@EspenHa @four4fish @peterdudfield @rohitgr7 @tchaton @kaushikb11 @awaelchli @Borda @carmocca

If we forgot someone due to not matching commit email with GitHub account, let us know :]

PyTorch Lightning 1.5: LightningLite, Fault-Tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI v2, RichProgressBar, CheckpointIO Plugin, and Trainer Strategy Flag

02 Nov 18:58
72288b2
Compare
Choose a tag to compare

The PyTorch Lightning team and its community are excited to announce Lightning 1.5, introducing support for LightningLite, Fault-tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI V2, RichProgressBar, CheckpointIO Plugin, Trainer Strategy flag, and more!

Highlights

Lightning 1.5 marks our biggest release yet. Over 60 contributors have worked on features, bugfixes and documentation improvements for a total of 640 commits since v1.4. Here are some highlights:

Fault-tolerant Training

Fault-tolerant Training is a new internal mechanism that enables PyTorch Lightning to recover from a hardware or software failure. This is particularly interesting while training in the cloud with preemptive instances which can shutdown at any time. Once a Lightning experiment unexpectedly exits, a temporary checkpoint is saved that contains the exact state of all loops and the model. With this new experimental feature, you will be able to restore your training mid-epoch on the exact batch and continue training as if it never got interrupted.

PL_FAULT_TOLERANT_TRAINING=1 python train.py

LightningLite

LightningLite enables pure PyTorch users to scale their existing code to any kind of hardware while retaining full control over their own loops and optimization logic.

With just a few lines of code and no large refactoring, you get support for multi-device, multi-node, running on different accelerators (CPU, GPU, TPU), native automatic mixed precision (half and bfloat16), and double precision, in just a few seconds. And no special launcher required! Check out our documentation to find out how you can get one step closer to boilerplate-free research!

class Lite(LightningLite):
    def run(self):
        # Let Lite setup your dataloader(s)
        train_loader = self.setup_dataloaders(torch.utils.data.DataLoader(...))

        model = Net()  # .to() not needed
        optimizer = optim.Adam(model.parameters())
        # Let Lite setup your model and optimizer
        model, optimizer = self.setup(model, optimizer)

        for epoch in range(5):
            for data, target in train_loader:
                optimizer.zero_grad()
                output = model(data)  # data is already on the device
                loss = F.nll_loss(output, target)
                self.backward(loss)  # instead of loss.backward()
                optimizer.step()


Lite(accelerator="gpu", devices="auto").run()

Loop Customization

The new Loop API lets advanced users swap out the default gradient descent optimization loop at the core of Lightning with a different optimization paradigm. This is part of our effort to make Lightning the simplest, most flexible framework to take any kind of deep learning research to production.

Read our comprehensive introduction to loops

New Rich Progress Bar

We integrated with Rich and created a new and improved progress bar for Lightning.
Try it out:

pip install rich
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import RichProgressBar

trainer = Trainer(callbacks=[RichProgressBar()])

New Trainer Arguments: Strategy and Devices

With the new strategy and devices arguments in the Trainer, it is now easer to switch from one hardware to another.

Before After
Trainer(accelerator="ddp", gpus=2) Trainer(accelerator="gpu", devices=2, strategy="ddp")
Trainer(accelerator="ddp_cpu", num_processes=2) Trainer(accelerator="cpu", devices=2, strategy="ddp")
Trainer(accelerator="tpu_spawn", tpu_cores=8) Trainer(accelerator="tpu", devices=8)

The new devices argument is now agnostic to all accelerators, but the previous arguments gpus, tpu_cores, ipus are still available and work the same as before. In addition, it is now also possible to set devices="auto" or accelerator="auto" to select the best accelerator available on the hardware.

from pytorch_lightning import Trainer

trainer = Trainer(accelerator="auto", devices="auto")

LightningCLI V2

This release adds support for running not just Trainer.fit but any of the Trainer entry points!

python script.py fit
python script.py test

LightningCLI now supports registries for callbacks, optimizers, learning rate schedulers, LightningModules and LightningDataModules. This greatly improves the command line experience as only the class names and arguments are required as follows:

python script.py \
    --trainer.callbacks=EarlyStopping \
    --trainer.callbacks.patience=5 \
    --trainer.callbacks.LearningRateMonitor \
    --trainer.callbacks.logging_interval=epoch \
    --optimizer=Adam \
    --optimizer.lr=0.01 \
    --lr_scheduler=OneCycleLR \
    --lr_scheduler=anneal_strategy=linear

We've also added support for a manual mode where the CLI takes care of the instantiation but you have control over the Trainer calls:

cli = LightningCLI(MyModel, run=False)
cli.trainer.fit(cli.model)

Try out LightninCLI!

CheckpointIO Plugins

As part of our commitment to extensibility, we have abstracted the checkpointing logic into a CheckpointIO plugin. This enables users to adapt Lightning to their own infrastructure.

from pytorch_lightning.plugins import CheckpointIO

class CustomCheckpointIO(CheckpointIO):
  
    def save_checkpoint(self, checkpoint, path):
        # put all logic related to saving a checkpoint here

    def load_checkpoint(self, path):
        # put all logic related to loading a checkpoint here

    def remove_checkpoint(self, path):
        # put all logic related to deleting a checkpoint here

BFloat16 Support

PyTorch 1.10 introduces native Automatic Mixed Precision (AMP) support for torch.bfloat16 on CPU (was already supported for TPUs), enabling higher performance compared with torch.float16. Switch to bfloat16 training by setting the argument:

from pytorch_lightning import Trainer

trainer = Trainer(precision="bf16")

Enable Auto Parameters Tying

It is pretty common to share parameters within a model. However, TPUs don't retain shared parameters once moved on the devices. Lightning now supports automatic detection and re-assignement to alleviate this problem from TPUs.

Infinite Training

Infinite training is now supported by setting Trainer(max_epochs=-1) for an unlimited number of epochs, or Trainer(max_steps=-1) for an endless epoch.

Note: you will want to avoid logging with on_epoch=True in case of max_steps=-1.

DeepSpeed Stage 1

DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale. Lightning now also supports the DeepSpeed ZeRO Stage 1 protocol that partitions your optimizer states across your GPUs to reduce memory.

from pytorch_lightning import Trainer

trainer = Trainer(gpus=4, strategy="deepspeed_stage_1", precision=16)
trainer.fit(model)

For even more memory savings and model sharding advice, check out stage 2 & 3 as well in our multi-GPU docs.

Gradient Clipping Customization

By overriding the LightningModule.configure_gradient_clipping hook, you can customize gradient clipping to your needs:

# Perform gradient clipping on gradients associated with discriminator (optimizer_idx=1) in GAN
def configure_gradient_clipping(
    self,
    optimizer,
    optimizer_idx,
    gradient_clip_val,
    gradient_clip_algorithm
):
    if optimizer_idx == 1:
        # Lightning will handle the gradient clipping
        self.clip_gradients(
            optimizer,
            gradient_clip_val=gradient_clip_val,
            gradient_clip_algorithm=gradient_clip_algorithm
        )

This means you can now implement state-of-the-art clipping algorithms with Lightning!

Determinism

Added support for torch.use_deterministic_algorithms. Read more about how it works here. You can enable it by setting:

from pytorch_lightning import Trainer

trainer = Trainer(deterministic=True)

Anomaly Detection

Lightning makes it easier to debug your code, so we've added support for torch.set_detect_anomaly. With this, PyTorch detects numerical anomalies like NaN or inf during forward and backward. Read more about anomaly detection here

from pytorch_lightning import Trainer

trainer = Trainer(detect_anomaly=True)

DDP Debugging Improvements

Are you having a hard time debugging DDP on your remote machine? Now you can de...

Read more

Standard weekly patch release

30 Sep 13:43
Compare
Choose a tag to compare

[1.4.9] - 2021-09-30

  • Moved the gradient unscaling in NativeMixedPrecisionPlugin from pre_optimizer_step to post_backward (#9606)
  • Fixed gradient unscaling being called too late, causing gradient clipping and gradient norm tracking to be applied incorrectly (#9606)
  • Fixed lr_find to generate same results on multiple calls (#9704)
  • Fixed reset metrics on validation epoch end (#9717)
  • Fixed input validation for gradient_clip_val, gradient_clip_algorithm, track_grad_norm and terminate_on_nan Trainer arguments (#9595)
  • Reset metrics before each task starts (#9410)

Contributors

@rohitgr7 @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

22 Sep 19:15
Compare
Choose a tag to compare

[1.4.8] - 2021-09-22

  • Fixed error reporting in DDP process reconciliation when processes are launched by an external agent (#9389)
  • Added PL_RECONCILE_PROCESS environment variable to enable process reconciliation regardless of cluster environment settings (#)(#9389)
  • Fixed add_argparse_args raising TypeError when args are typed as typing.Generic in Python 3.6 (#9554)
  • Fixed back-compatibility for saving hyperparameters from a single container and inferring its argument name by reverting #9125 (#9642)

Contributors

@ananthsub @akihironitta @awaelchli @carmocca @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

15 Sep 09:28
Compare
Choose a tag to compare

[1.4.7] - 2021-09-14

  • Fixed logging of nan parameters (#9364)
  • Fixed replace_sampler missing the batch size under specific conditions (#9367)
  • Pass init args to ShardedDataParallel (#9483)
  • Fixed collision of user argument when using ShardedDDP (#9512)
  • Fixed DeepSpeed crash for RNNs (#9489)

Contributors

@asanakoy @awaelchli @borisdayma @carmocca @guotuofeng @justusschock @kaushikb11 @rohitgr7 @SeanNaren

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

10 Sep 16:24
Compare
Choose a tag to compare

[1.4.6] - 2021-09-10

  • Fixed an issues with export to ONNX format when a model has multiple inputs (#8800)
  • Removed deprecation warnings being called for on_{task}_dataloader (#9279)
  • Fixed save/load/resume from checkpoint for DeepSpeed Plugin (#8397, #8644, #8627)
  • Fixed EarlyStopping running on train epoch end when check_val_every_n_epoch>1 is set (#9156)
  • Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8333)
  • Fixed the Apex and DeepSpeed plugin closure running after the on_before_optimizer_step hook (#9288)
  • Fixed the Native AMP plugin closure not running with manual optimization (#9288)
  • Fixed bug where data-loading functions where not getting the correct running stage passed (#8858)
  • Fixed intra-epoch evaluation outputs staying in memory when the respective *_epoch_end hook wasn't overridden (#9261)
  • Fixed error handling in DDP process reconciliation when _sync_dir was not initialized (#9267)
  • Fixed PyTorch Profiler not enabled for manual optimization (#9316)
  • Fixed inspection of other args when a container is specified in save_hyperparameters (#9125)
  • Fixed signature of Timer.on_train_epoch_end and StochasticWeightAveraging.on_train_epoch_end to prevent unwanted deprecation warnings (#9347)

Contributors

@ananthsub @awaelchli @Borda @four4fish @justusschock @kaushikb11 @s-rog @SeanNaren @tangbinh @tchaton @xerus

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

01 Sep 13:27
Compare
Choose a tag to compare

[1.4.5] - 2021-08-31

  • Fixed reduction using self.log(sync_dict=True, reduce_fx={mean,max}) (#9142)
  • Fixed not setting a default value for max_epochs if max_time was specified on the Trainer constructor (#9072)
  • Fixed the CometLogger, no longer modifies the metrics in place. Instead creates a copy of metrics before performing any operations (#9150)
  • Fixed DDP "CUDA error: initialization error" due to a copy instead of deepcopy on ResultCollection (#9239)

Contributors

@ananthsub @bamblebam @carmocca @daniellepintz @ethanwharris @kaushikb11 @sohamtiwari3120 @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

24 Aug 15:10
Compare
Choose a tag to compare

[1.4.4] - 2021-08-24

  • Fixed a bug in the binary search mode of auto batch size scaling where exception was raised if the first trainer run resulted in OOM (#8954)
  • Fixed a bug causing logging with log_gpu_memory='min_max' not working (#9013)

Contributors

@SkafteNicki @eladsegal

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

23 Aug 15:55
Compare
Choose a tag to compare

[1.4.3] - 2021-08-17

  • Fixed plateau scheduler stepping on incomplete epoch (#8861)
  • Fixed infinite loop with CycleIterator and multiple loaders (#8889)
  • Fixed StochasticWeightAveraging with a list of learning rates not applying them to each param group (#8747)
  • Restore original loaders if replaced by entrypoint (#8885)
  • Fixed lost reference to _Metadata object in ResultMetricCollection (#8932)
  • Ensure the existence of DDPPlugin._sync_dir in reconciliate_processes (#8939)

Contributors

@awaelchli @carmocca @justusschock @tchaton @yifuwang
If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

11 Aug 13:51
Compare
Choose a tag to compare

[1.4.2] - 2021-08-10

  • Fixed recursive call for apply_to_collection(include_none=False) (#8719)
  • Fixed truncated backprop through time enablement when set as a property on the LightningModule and not the Trainer (#8804)
  • Fixed comments and exception message for metrics_to_scalars (#8782)
  • Fixed typo error in LightningLoggerBase.after_save_checkpoint docstring (#8737)

Contributors

@Aiden-Jeon @ananthsub @awaelchli @edward-io
If we forgot someone due to not matching commit email with GitHub account, let us know :]