Releases: Lightning-AI/pytorch-lightning
PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.
The core team is excited to announce the PyTorch Lightning 1.6 release ⚡
Highlights
PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:
Introducing Intel's Habana Accelerator
Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.
You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:
trainer = pl.Trainer(accelerator="hpu")
# single Gaudi training
trainer = pl.Trainer(accelerator="hpu", devices=1)
# distributed training with 8 Gaudi
trainer = pl.Trainer(accelerator="hpu", devices=8)
The Bagua Strategy
The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:
trainer = pl.Trainer(strategy="bagua")
# or to choose a custom algorithm
trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce") # default
Towards stable Accelerator, Strategy, and Plugin APIs
The Accelerator
, Strategy
, and Plugin
APIs are a core part of PyTorch Lightning. They're where all the distributed boilerplate lives, and we're constantly working to improve both them and the overall PyTorch Lightning platform experience.
In this release, we've made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (TrainingTypePlugin
) as well as certain Plugins. In particular, we want to highlight the following changes:
-
All
TrainingTypePlugin
s have been renamed toStrategy
(#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the newstrategy
anddevices
flags to the Trainer.# Before from pytorch_lightning.plugins import DDPPlugin # New from pytorch_lightning.strategies import DDPStrategy
-
The
Accelerator
andPrecisionPlugin
have moved intoStrategy
. All strategies now take an optional parameteraccelerator
andprecision_plugin
(#11022, #10570). -
Custom Accelerator implementations must now implement two new abstract methods:
is_available()
(#11797) andauto_device_count()
(#10222). The latter determines how many devices get used by default when specifyingTrainer(accelerator=..., devices="auto")
. -
We redesigned the process creation for spawn-based strategies such as
DDPSpawnStrategy
andTPUSpawnStrategy
(#10896). All spawn-based strategies now spawn processes immediately upon callingTrainer.{fit,validate,test,predict}
, which means the hooks/callbacksprepare_data
,setup
,configure_sharded_model
andteardown
all run under an initialized process group. These changes align the spawn-based strategies with their non-spawn counterparts (such asDDPStrategy
).
We've also exposed the process group backend for use. For example, you can now easily enable fairring
like this:
# Explicitly specify the process group backend if you choose to
ddp = pl.strategies.DDPStrategy(process_group_backend="fairring")
trainer = Trainer(strategy=ddp, accelerator="gpu", devices=8)
In a similar fashion, if installing torch>=1.11
, you can enable DDP static graph to apply special runtime optimizations:
trainer = Trainer(devices=4, strategy=DDPStrategy(static_graph=True))
LightningCLI
improvements
In the previous release, we added shorthand notation support for registered components. In this release, we added a flag to automatically register all available components:
from pytorch_lightning.utilities.cli import LightningCLI
LightningCLI(auto_registry=True)
We have also added support for the ReduceLROnPlateau
scheduler with shorthand notation:
$ python script.py fit --optimizer=Adam --lr_scheduler=ReduceLROnPlateau --lr_scheduler.monitor=metric_to_track
If you need to customize the learning rate scheduler configuration, you can do so by overriding:
class MyLightningCLI(LightningCLI):
@staticmethod
def configure_optimizers(lightning_module, optimizer, lr_scheduler=None):
return {"optimizer": optimizer, "lr_scheduler": {"scheduler": lr_scheduler, ...}}
Finally, loggers are also now configurable with shorthand:
$ python script.py fit --trainer.logger=WandbLogger --trainer.logger.name="my_lightning_run"
Control SLURM's re-queueing
We've added the ability to turn the automatic resubmission on or off when a job gets interrupted by the SLURM controller (via signal handling). Users who prefer to let their code handle the resubmission (for example, when submitit is used) can now pass:
from pytorch_lightning.plugins.environments import SLURMEnvironment
trainer = pl.Trainer(plugins=SLURMEnvironment(auto_requeue=False))
Fault-tolerance improvements
The Fault-tolerance training under manual optimization now tracks optimization progress. We also changed the graceful exit signal from SIGUSR1
to SIGTERM
for better support inside cloud instances.
An additional feature we're excited to announce is support for consecutive trainer.fit()
calls.
trainer = pl.Trainer(max_epochs=2)
trainer.fit(model)
# now, run 2 more epochs
trainer.fit_loop.max_epochs = 4
trainer.fit(model)
Loop customization improvements
The Loop
's state is now included as part of the checkpoints saved by the library. This enables finer restoration of custom loops.
We've also made it easier to replace Lightning's loops with your own. For example:
class MyCustomLoop(pl.loops.TrainingEpochLoop):
...
trainer = pl.Trainer(...)
trainer.fit_loop.replace(epoch_loop=MyCustomLoop)
# Trainer runs the fit loop with your new epoch loop!
trainer.fit(model)
Data-Loading improvements
In previous versions, Lightning required that the DataLoader
instance set its input arguments as instance attributes. This meant that custom DataLoader
s also had this hidden requirement. In this release, we do this automatically for the user, easing the passing of custom loaders:
class MyDataLoader(torch.utils.data.DataLoader):
def __init__(self, a=123, *args, **kwargs):
- # this was required before
- self.a = a
super().__init__(*args, **kwargs)
trainer.fit(model, train_dataloader=MyDataLoader())
As of this release, Lightning no longer pre-fetches 1 extra batch if it doesn't need to. Previously, doing so would conflict with the internal pre-fetching done by optimized data loaders such as FFCV's. You can now define your own pre-fetching value like this:
class MyCustomLoop(pl.loops.FitLoop):
@property
def prefetch_batches(self):
return 7 # lucky number 7
trainer = pl.Trainer(...)
trainer.fit_loop = MyCustomLoop(min_epochs=trainer.min_epochs, max_epochs=trainer.max_epochs)
New Hooks
LightningModule.lr_scheduler_step
Lightning now allows the use of custom learning rate schedulers that aren't natively available in PyTorch. A great example of this is Timm Schedulers.
When using custom learning rate schedulers relying on an API other than PyTorch's, you can now define the LightningModule.lr_scheduler_step
with your desired logic.
from timm.scheduler import TanhLRScheduler
class MyLightningModule(pl.LightningModule):
def configure_optimizers(self):...
Standard weekly patch release
[1.5.10] - 2022-02-08
Fixed
- Fixed an issue to avoid validation loop run on restart (#11552)
- The Rich progress bar now correctly shows the
on_epoch
logged values on train epoch end (#11689) - Fixed an issue to make the
step
argument inWandbLogger.log_image
work (#11716) - Fixed
restore_optimizers
for mapping states (#11757) - With
DPStrategy
, the batch is not explicitly moved to the device (#11780) - Fixed an issue to avoid val bar disappear after
trainer.validate()
(#11700) - Fixed supporting remote filesystems with
Trainer.weights_save_path
for fault-tolerant training (#11776) - Fixed check for available modules (#11526)
- Fixed bug where the path for "last" checkpoints was not getting saved correctly which caused newer runs to not remove the previous "last" checkpoint (#11481)
- Fixed bug where the path for best checkpoints was not getting saved correctly when no metric was monitored which caused newer runs to not use the best checkpoint (#11481)
Contributors
@ananthsub @Borda @circlecrystal @NathanGodey @nithinraok @rohitgr7
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.9] - 2022-01-20
Fixed
- Pinned sphinx-autodoc-typehints with <v1.15 (#11400)
- Skipped testing with PyTorch 1.7 and Python 3.9 on Ubuntu (#11217)
- Fixed type promotion when tensors of higher category than float are logged (#11401)
- Fixed the format of the configuration saved automatically by the CLI's
SaveConfigCallback
(#11532)
Changed
- Changed
LSFEnvironment
to useLSB_DJOB_RANKFILE
environment variable instead ofLSB_HOSTS
for determining node rank and main address (#10825) - Disabled sampler replacement when using
IterableDataset
(#11507)
Contributors
@ajtritt @akihironitta @carmocca @rohitgr7
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.8] - 2022-01-05
Fixed
- Fixed
LightningCLI
race condition while saving the config (#11199) - Fixed the default value used with
log(reduce_fx=min|max)
(#11310) - Fixed data fetcher selection (#11294)
- Fixed a race condition that could result in incorrect (zero) values being observed in prediction writer callbacks (#11288)
- Fixed dataloaders not getting reloaded the correct amount of times when setting
reload_dataloaders_every_n_epochs
andcheck_val_every_n_epoch
(#10948)
Contributors
@adamviola @akihironitta @awaelchli @Borda @carmocca @edpizzi
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.7] - 2021-12-21
Fixed
- Fixed
NeptuneLogger
when using DDP (#11030) - Fixed a bug to disable logging hyperparameters in logger if there are no hparams (#11105)
- Avoid the deprecated
onnx.export(example_outputs=...)
in torch 1.10 (#11116) - Fixed an issue when torch-scripting a
LightningModule
after training withTrainer(sync_batchnorm=True)
(#11078) - Fixed an
AttributeError
occuring when using aCombinedLoader
(multiple dataloaders) for prediction (#11111) - Fixed bug where
Trainer(track_grad_norm=..., logger=False)
would fail (#11114) - Fixed an incorrect warning being produced by the model summary when using
bf16
precision on CPU (#11161)
Changed
- DeepSpeed does not require lightning module zero 3 partitioning (#10655)
- The
ModelCheckpoint
callback now saves and restores attributesbest_k_models
,kth_best_model_path
,kth_value
, andlast_model_path
(#10995)
Contributors
@awaelchli @borchero @carmocca @guyang3532 @kaushikb11 @ORippler @Raalsky @rohitgr7 @SeanNaren
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.6] - 2021-12-15
Fixed
- Fixed a bug where the DeepSpeedPlugin arguments
cpu_checkpointing
andcontiguous_memory_optimization
were not being forwarded to deepspeed correctly (#10874) - Fixed an issue with
NeptuneLogger
causing checkpoints to be uploaded with a duplicated file extension (#11015) - Fixed support for logging within callbacks returned from
LightningModule
(#10991) - Fixed running sanity check with
RichProgressBar
(#10913) - Fixed support for
CombinedLoader
while checking for warning raised with eval dataloaders (#10994) - The TQDM progress bar now correctly shows the
on_epoch
logged values on train epoch end (#11069) - Fixed bug where the TQDM updated the training progress bar during
trainer.validate
(#11069)
Contributors
@carmocca @jona-0 @kaushikb11 @Raalsky @rohitgr7
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.5] - 2021-12-07
Fixed
- Disabled batch_size extraction for torchmetric instances because they accumulate the metrics internally (#10815)
- Fixed an issue with
SignalConnector
not restoring the default signal handlers on teardown when running on SLURM or with fault-tolerant training enabled (#10611) - Fixed
SignalConnector._has_already_handler
check for callable type (#10483) - Fixed an issue to return the results for each dataloader separately instead of duplicating them for each (#10810)
- Improved exception message if
rich
version is less than10.2.2
(#10839) - Fixed uploading best model checkpoint in NeptuneLogger (#10369)
- Fixed early schedule reset logic in PyTorch profiler that was causing data leak (#10837)
- Fixed a bug that caused incorrect batch indices to be passed to the
BasePredictionWriter
hooks when using a dataloader withnum_workers > 0
(#10870) - Fixed an issue with item assignment on the logger on rank > 0 for those who support it (#10917)
- Fixed importing
torch_xla.debug
fortorch-xla<1.8
(#10836) - Fixed an issue with
DDPSpawnPlugin
and related plugins leaving a temporary checkpoint behind (#10934) - Fixed a
TypeError
occuring in theSingalConnector.teardown()
method (#10961)
Contributors
@awaelchli @carmocca @four4fish @kaushikb11 @lucmos @mauvilsa @Raalsky @rohitgr7
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.4] - 2021-11-30
Fixed
- Fixed support for
--key.help=class
with theLightningCLI
(#10767) - Fixed
_compare_version
for python packages (#10762) - Fixed TensorBoardLogger
SummaryWriter
not close before spawning the processes (#10777) - Fixed a consolidation error in Lite when attempting to save the state dict of a sharded optimizer (#10746)
- Fixed the default logging level for batch hooks associated with training from
on_step=False, on_epoch=True
toon_step=True, on_epoch=False
(#10756)
Removed
Contributors
@awaelchli @carmocca @kaushikb11 @rohitgr7 @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.3] - 2021-11-24
Fixed
- Fixed
ShardedTensor
state dict hook registration to check if torch distributed is available (#10621) - Fixed an issue with
self.log
not respecting a tensor'sdtype
when applying computations (#10076) - Fixed LigtningLite
_wrap_init
popping unexisting keys from DataLoader signature parameters (#10613) - Fixed signals being registered within threads (#10610)
- Fixed an issue that caused Lightning to extract the batch size even though it was set by the user in
LightningModule.log
(#10408) - Fixed
Trainer(move_metrics_to_cpu=True)
not moving the evaluation logged results to CPU (#10631) - Fixed the
{validation,test}_step
outputs getting moved to CPU withTrainer(move_metrics_to_cpu=True)
(#10631) - Fixed signals being registered within threads (#10610)
- Fixed an issue with collecting logged test results with multiple dataloaders (#10522)
Contributors
@ananthsub @awaelchli @carmocca @jiwidi @kaushikb11 @qqueing @rohitgr7 @shabie @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.2] - 2021-11-16
Fixed
- Fixed
CombinedLoader
andmax_size_cycle
didn't receive aDistributedSampler
(#10374) - Fixed an issue where class or init-only variables of dataclasses were passed to the dataclass constructor in
utilities.apply_to_collection
(#9702) - Fixed
isinstance
not working withinit_meta_context
, materialized model not being moved to the device (#10493) - Fixed an issue that prevented the Trainer to shutdown workers when execution is interrupted due to failure(#10463)
- Squeeze the early stopping monitor to remove empty tensor dimensions (#10461)
- Fixed sampler replacement logic with
overfit_batches
to only replace the sample whenSequentialSampler
is not used (#10486) - Fixed scripting causing false positive deprecation warnings (#10470, #10555)
- Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
- Fixed propagation of device and dtype information to submodules of LightningLite when they inherit from
DeviceDtypeModuleMixin
(#10559)
Contributors
@a-gardner1 @awaelchli @carmocca @justusschock @Raahul-Singh @rohitgr7 @SeanNaren @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]