Releases: Lightning-AI/pytorch-lightning
Standard weekly patch release
[1.5.1] - 2021-11-09
Fixed
- Fixed
apply_to_collection(defaultdict)
(#10316) - Fixed failure when
DataLoader(batch_size=None)
is passed (#10345) - Fixed interception of
__init__
arguments for sub-classed DataLoader re-instantiation in Lite (#10334) - Fixed issue with pickling
CSVLogger
after a call toCSVLogger.save
(#10388) - Fixed an import error being caused by
PostLocalSGD
whentorch.distributed
not available (#10359) - Fixed the logging with
on_step=True
in epoch-level hooks causing unintended side-effects. Logging withon_step=True
in epoch-level hooks will now correctly raise an error (#10409) - Fixed deadlocks for distributed training with
RichProgressBar
(#10428) - Fixed an issue where the model wrapper in Lite converted non-floating point tensors to float (#10429)
- Fixed an issue with inferring the dataset type in fault-tolerant training (#10432)
- Fixed dataloader workers with
persistent_workers
being deleted on every iteration (#10434)
Contributors
@EspenHa @four4fish @peterdudfield @rohitgr7 @tchaton @kaushikb11 @awaelchli @Borda @carmocca
If we forgot someone due to not matching commit email with GitHub account, let us know :]
PyTorch Lightning 1.5: LightningLite, Fault-Tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI v2, RichProgressBar, CheckpointIO Plugin, and Trainer Strategy Flag
The PyTorch Lightning team and its community are excited to announce Lightning 1.5, introducing support for LightningLite, Fault-tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI V2, RichProgressBar, CheckpointIO Plugin, Trainer Strategy flag, and more!
Highlights
Lightning 1.5 marks our biggest release yet. Over 60 contributors have worked on features, bugfixes and documentation improvements for a total of 640 commits since v1.4. Here are some highlights:
Fault-tolerant Training
Fault-tolerant Training is a new internal mechanism that enables PyTorch Lightning to recover from a hardware or software failure. This is particularly interesting while training in the cloud with preemptive instances which can shutdown at any time. Once a Lightning experiment unexpectedly exits, a temporary checkpoint is saved that contains the exact state of all loops and the model. With this new experimental feature, you will be able to restore your training mid-epoch on the exact batch and continue training as if it never got interrupted.
PL_FAULT_TOLERANT_TRAINING=1 python train.py
LightningLite
LightningLite enables pure PyTorch users to scale their existing code to any kind of hardware while retaining full control over their own loops and optimization logic.
With just a few lines of code and no large refactoring, you get support for multi-device, multi-node, running on different accelerators (CPU, GPU, TPU), native automatic mixed precision (half
and bfloat16
), and double precision, in just a few seconds. And no special launcher required! Check out our documentation to find out how you can get one step closer to boilerplate-free research!
class Lite(LightningLite):
def run(self):
# Let Lite setup your dataloader(s)
train_loader = self.setup_dataloaders(torch.utils.data.DataLoader(...))
model = Net() # .to() not needed
optimizer = optim.Adam(model.parameters())
# Let Lite setup your model and optimizer
model, optimizer = self.setup(model, optimizer)
for epoch in range(5):
for data, target in train_loader:
optimizer.zero_grad()
output = model(data) # data is already on the device
loss = F.nll_loss(output, target)
self.backward(loss) # instead of loss.backward()
optimizer.step()
Lite(accelerator="gpu", devices="auto").run()
Loop Customization
The new Loop API lets advanced users swap out the default gradient descent optimization loop at the core of Lightning with a different optimization paradigm. This is part of our effort to make Lightning the simplest, most flexible framework to take any kind of deep learning research to production.
Read our comprehensive introduction to loops
New Rich Progress Bar
We integrated with Rich and created a new and improved progress bar for Lightning.
Try it out:
pip install rich
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import RichProgressBar
trainer = Trainer(callbacks=[RichProgressBar()])
New Trainer Arguments: Strategy and Devices
With the new strategy and devices arguments in the Trainer, it is now easer to switch from one hardware to another.
Before | After |
---|---|
Trainer(accelerator="ddp", gpus=2) |
Trainer(accelerator="gpu", devices=2, strategy="ddp") |
Trainer(accelerator="ddp_cpu", num_processes=2) |
Trainer(accelerator="cpu", devices=2, strategy="ddp") |
Trainer(accelerator="tpu_spawn", tpu_cores=8) |
Trainer(accelerator="tpu", devices=8) |
The new devices
argument is now agnostic to all accelerators, but the previous arguments gpus
, tpu_cores
, ipus
are still available and work the same as before. In addition, it is now also possible to set devices="auto"
or accelerator="auto"
to select the best accelerator available on the hardware.
from pytorch_lightning import Trainer
trainer = Trainer(accelerator="auto", devices="auto")
LightningCLI V2
This release adds support for running not just Trainer.fit
but any of the Trainer
entry points!
python script.py fit
python script.py test
LightningCLI now supports registries for callbacks, optimizers, learning rate schedulers, LightningModules and LightningDataModules. This greatly improves the command line experience as only the class names and arguments are required as follows:
python script.py \
--trainer.callbacks=EarlyStopping \
--trainer.callbacks.patience=5 \
--trainer.callbacks.LearningRateMonitor \
--trainer.callbacks.logging_interval=epoch \
--optimizer=Adam \
--optimizer.lr=0.01 \
--lr_scheduler=OneCycleLR \
--lr_scheduler=anneal_strategy=linear
We've also added support for a manual mode where the CLI takes care of the instantiation but you have control over the Trainer
calls:
cli = LightningCLI(MyModel, run=False)
cli.trainer.fit(cli.model)
CheckpointIO Plugins
As part of our commitment to extensibility, we have abstracted the checkpointing logic into a CheckpointIO plugin. This enables users to adapt Lightning to their own infrastructure.
from pytorch_lightning.plugins import CheckpointIO
class CustomCheckpointIO(CheckpointIO):
def save_checkpoint(self, checkpoint, path):
# put all logic related to saving a checkpoint here
def load_checkpoint(self, path):
# put all logic related to loading a checkpoint here
def remove_checkpoint(self, path):
# put all logic related to deleting a checkpoint here
BFloat16 Support
PyTorch 1.10 introduces native Automatic Mixed Precision (AMP) support for torch.bfloat16
on CPU (was already supported for TPUs), enabling higher performance compared with torch.float16
. Switch to bfloat16 training by setting the argument:
from pytorch_lightning import Trainer
trainer = Trainer(precision="bf16")
Enable Auto Parameters Tying
It is pretty common to share parameters within a model. However, TPUs don't retain shared parameters once moved on the devices. Lightning now supports automatic detection and re-assignement to alleviate this problem from TPUs.
Infinite Training
Infinite training is now supported by setting Trainer(max_epochs=-1)
for an unlimited number of epochs, or Trainer(max_steps=-1)
for an endless epoch.
Note: you will want to avoid logging with
on_epoch=True
in case ofmax_steps=-1
.
DeepSpeed Stage 1
DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale. Lightning now also supports the DeepSpeed ZeRO Stage 1 protocol that partitions your optimizer states across your GPUs to reduce memory.
from pytorch_lightning import Trainer
trainer = Trainer(gpus=4, strategy="deepspeed_stage_1", precision=16)
trainer.fit(model)
For even more memory savings and model sharding advice, check out stage 2 & 3 as well in our multi-GPU docs.
Gradient Clipping Customization
By overriding the LightningModule.configure_gradient_clipping
hook, you can customize gradient clipping to your needs:
# Perform gradient clipping on gradients associated with discriminator (optimizer_idx=1) in GAN
def configure_gradient_clipping(
self,
optimizer,
optimizer_idx,
gradient_clip_val,
gradient_clip_algorithm
):
if optimizer_idx == 1:
# Lightning will handle the gradient clipping
self.clip_gradients(
optimizer,
gradient_clip_val=gradient_clip_val,
gradient_clip_algorithm=gradient_clip_algorithm
)
This means you can now implement state-of-the-art clipping algorithms with Lightning!
Determinism
Added support for torch.use_deterministic_algorithms
. Read more about how it works here. You can enable it by setting:
from pytorch_lightning import Trainer
trainer = Trainer(deterministic=True)
Anomaly Detection
Lightning makes it easier to debug your code, so we've added support for torch.set_detect_anomaly
. With this, PyTorch detects numerical anomalies like NaN or inf during forward and backward. Read more about anomaly detection here
from pytorch_lightning import Trainer
trainer = Trainer(detect_anomaly=True)
DDP Debugging Improvements
Are you having a hard time debugging DDP on your remote machine? Now you can de...
Standard weekly patch release
[1.4.9] - 2021-09-30
- Moved the gradient unscaling in
NativeMixedPrecisionPlugin
frompre_optimizer_step
topost_backward
(#9606) - Fixed gradient unscaling being called too late, causing gradient clipping and gradient norm tracking to be applied incorrectly (#9606)
- Fixed
lr_find
to generate same results on multiple calls (#9704) - Fixed
reset
metrics on validation epoch end (#9717) - Fixed input validation for
gradient_clip_val
,gradient_clip_algorithm
,track_grad_norm
andterminate_on_nan
Trainer arguments (#9595) - Reset metrics before each task starts (#9410)
Contributors
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.4.8] - 2021-09-22
- Fixed error reporting in DDP process reconciliation when processes are launched by an external agent (#9389)
- Added PL_RECONCILE_PROCESS environment variable to enable process reconciliation regardless of cluster environment settings (#)(#9389)
- Fixed
add_argparse_args
raisingTypeError
when args are typed astyping.Generic
in Python 3.6 (#9554) - Fixed back-compatibility for saving hyperparameters from a single container and inferring its argument name by reverting #9125 (#9642)
Contributors
@ananthsub @akihironitta @awaelchli @carmocca @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.4.7] - 2021-09-14
- Fixed logging of nan parameters (#9364)
- Fixed
replace_sampler
missing the batch size under specific conditions (#9367) - Pass init args to ShardedDataParallel (#9483)
- Fixed collision of user argument when using ShardedDDP (#9512)
- Fixed DeepSpeed crash for RNNs (#9489)
Contributors
@asanakoy @awaelchli @borisdayma @carmocca @guotuofeng @justusschock @kaushikb11 @rohitgr7 @SeanNaren
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.4.6] - 2021-09-10
- Fixed an issues with export to ONNX format when a model has multiple inputs (#8800)
- Removed deprecation warnings being called for
on_{task}_dataloader
(#9279) - Fixed save/load/resume from checkpoint for DeepSpeed Plugin (#8397, #8644, #8627)
- Fixed
EarlyStopping
running on train epoch end whencheck_val_every_n_epoch>1
is set (#9156) - Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8333)
- Fixed the Apex and DeepSpeed plugin closure running after the
on_before_optimizer_step
hook (#9288) - Fixed the Native AMP plugin closure not running with manual optimization (#9288)
- Fixed bug where data-loading functions where not getting the correct running stage passed (#8858)
- Fixed intra-epoch evaluation outputs staying in memory when the respective
*_epoch_end
hook wasn't overridden (#9261) - Fixed error handling in DDP process reconciliation when
_sync_dir
was not initialized (#9267) - Fixed PyTorch Profiler not enabled for manual optimization (#9316)
- Fixed inspection of other args when a container is specified in
save_hyperparameters
(#9125) - Fixed signature of
Timer.on_train_epoch_end
andStochasticWeightAveraging.on_train_epoch_end
to prevent unwanted deprecation warnings (#9347)
Contributors
@ananthsub @awaelchli @Borda @four4fish @justusschock @kaushikb11 @s-rog @SeanNaren @tangbinh @tchaton @xerus
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.4.5] - 2021-08-31
- Fixed reduction using
self.log(sync_dict=True, reduce_fx={mean,max})
(#9142) - Fixed not setting a default value for
max_epochs
ifmax_time
was specified on theTrainer
constructor (#9072) - Fixed the CometLogger, no longer modifies the metrics in place. Instead creates a copy of metrics before performing any operations (#9150)
- Fixed
DDP
"CUDA error: initialization error" due to acopy
instead ofdeepcopy
onResultCollection
(#9239)
Contributors
@ananthsub @bamblebam @carmocca @daniellepintz @ethanwharris @kaushikb11 @sohamtiwari3120 @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.4.4] - 2021-08-24
- Fixed a bug in the binary search mode of auto batch size scaling where exception was raised if the first trainer run resulted in OOM (#8954)
- Fixed a bug causing logging with
log_gpu_memory='min_max'
not working (#9013)
Contributors
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.4.3] - 2021-08-17
- Fixed plateau scheduler stepping on incomplete epoch (#8861)
- Fixed infinite loop with
CycleIterator
and multiple loaders (#8889) - Fixed
StochasticWeightAveraging
with a list of learning rates not applying them to each param group (#8747) - Restore original loaders if replaced by entrypoint (#8885)
- Fixed lost reference to
_Metadata
object inResultMetricCollection
(#8932) - Ensure the existence of
DDPPlugin._sync_dir
inreconciliate_processes
(#8939)
Contributors
@awaelchli @carmocca @justusschock @tchaton @yifuwang
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.4.2] - 2021-08-10
- Fixed recursive call for
apply_to_collection(include_none=False)
(#8719) - Fixed truncated backprop through time enablement when set as a property on the LightningModule and not the Trainer (#8804)
- Fixed comments and exception message for metrics_to_scalars (#8782)
- Fixed typo error in LightningLoggerBase.after_save_checkpoint docstring (#8737)
Contributors
@Aiden-Jeon @ananthsub @awaelchli @edward-io
If we forgot someone due to not matching commit email with GitHub account, let us know :]