Releases: Lightning-AI/pytorch-lightning
Standard weekly patch release
[1.4.1] - 2021-08-03
- Fixed
trainer.fit_loop.split_idx
always returningNone
(#8601) - Fixed references for
ResultCollection.extra
(#8622) - Fixed reference issues during epoch end result collection (#8621)
- Fixed horovod auto-detection when horovod is not installed and the launcher is
mpirun
(#8610) - Fixed an issue with
training_step
outputs not getting collected correctly fortraining_epoch_end
(#8613) - Fixed distributed types support for CPUs (#8667)
- Fixed a deadlock issue with DDP and torchelastic (#8655)
- Fixed
accelerator=ddp
choice for CPU (#8645)
Contributors
@awaelchli, @Borda, @carmocca, @kaushikb11, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
TPU Pod Training, IPU Accelerator, DeepSpeed Infinity, Fully Sharded Data Parallel
Today we are excited to announce Lightning 1.4, introducing support for TPU pods, XLA profiling, IPUs, and new plugins to reach 10+ billion parameters, including Deep Speed Infinity, Fully Sharded Data-Parallel and more!
https://devblog.pytorchlightning.ai/announcing-lightning-1-4-8cd20482aee9
[1.4.0] - 2021-07-27
Added
- Added
extract_batch_size
utility and corresponding tests to extract batch dimension from multiple batch types (#8357) - Added support for named parameter groups in
LearningRateMonitor
(#7987) - Added
dataclass
support forpytorch_lightning.utilities.apply_to_collection
(#7935) - Added support to
LightningModule.to_torchscript
for saving to custom filesystems withfsspec
(#7617) - Added
KubeflowEnvironment
for use with thePyTorchJob
operator in Kubeflow - Added LightningCLI support for config files on object stores (#7521)
- Added
ModelPruning(prune_on_train_epoch_end=True|False)
to choose when to apply pruning (#7704) - Added support for checkpointing based on a provided time interval during training (#7515)
- Progress tracking
- Added support for passing a
LightningDataModule
positionally as the second argument totrainer.{validate,test,predict}
(#7431) - Added argument
trainer.predict(ckpt_path)
(#7430) - Added
clip_grad_by_value
support for TPUs (#7025) - Added support for passing any class to
is_overridden
(#7918) - Added
sub_dir
parameter toTensorBoardLogger
(#6195) - Added correct
dataloader_idx
to batch transfer hooks (#6241) - Added
include_none=bool
argument toapply_to_collection
(#7769) - Added
apply_to_collections
to apply a function to two zipped collections (#7769) - Added
ddp_fully_sharded
support (#7487) - Added
should_rank_save_checkpoint
property to Training Plugins (#7684) - Added
log_grad_norm
hook toLightningModule
to customize the logging of gradient norms (#7873) - Added
save_config_filename
init argument toLightningCLI
to ease resolving name conflicts (#7741) - Added
save_config_overwrite
init argument toLightningCLI
to ease overwriting existing config files (#8059) - Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
- Added trainer stage hooks for Training Plugins and Accelerators (#7864)
- Added the
on_before_optimizer_step
hook (#8048) - Added IPU Accelerator (#7867)
- Fault-tolerant training
- Added
{,load_}state_dict
toResultCollection
(#7948) - Added
{,load_}state_dict
toLoops
(#8197) - Set
Loop.restarting=False
at the end of the first iteration (#8362) - Save the loops state with the checkpoint (opt-in) (#8362)
- Save a checkpoint to restore the state on exception (opt-in) (#8362)
- Added
state_dict
andload_state_dict
utilities forCombinedLoader
+ utilities for dataloader (#8364)
- Added
- Added
rank_zero_only
toLightningModule.log
function (#7966) - Added
metric_attribute
toLightningModule.log
function (#7966) - Added a warning if
Trainer(log_every_n_steps)
is a value too high for the training dataloader (#7734) - Added LightningCLI support for argument links applied on instantiation (#7895)
- Added LightningCLI support for configurable callbacks that should always be present (#7964)
- Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
- Added support for
torch.nn.UninitializedParameter
inModelSummary
(#7642) - Added support
LightningModule.save_hyperparameters
whenLightningModule
is a dataclass (#7992) - Added support for overriding
optimizer_zero_grad
andoptimizer_step
when using accumulate_grad_batches (#7980) - Added
logger
boolean flag tosave_hyperparameters
(#7960) - Added support for calling scripts using the module syntax (
python -m package.script
) (#8073) - Added support for optimizers and learning rate schedulers to
LightningCLI
(#8093) - Added XLA Profiler (#8014)
- Added
PrecisionPlugin.{pre,post}_backward
(#8328) - Added
on_load_checkpoint
andon_save_checkpoint
hooks to thePrecisionPlugin
base class (#7831) - Added
max_depth
parameter inModelSummary
(#8062) - Added
XLAStatsMonitor
callback (#8235) - Added
restore
function andrestarting
attribute to baseLoop
(#8247) - Added
FastForwardSampler
andCaptureIterableDataset
(#8307) - Added support for
save_hyperparameters
inLightningDataModule
(#3792) - Added the
ModelCheckpoint(save_on_train_epoch_end)
to choose when to run the saving logic (#8389) - Added
LSFEnvironment
for distributed training with the LSF resource managerjsrun
(#5102) - Added support for
accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto'
(#7808) - Added
tpu_spawn_debug
to plugin registry (#7933) - Enabled traditional/manual launching of DDP processes through
LOCAL_RANK
andNODE_RANK
environment variable assignments (#7480) - Added
quantize_on_fit_end
argument toQuantizationAwareTraining
(#8464) - Added experimental support for loop specialization (#8226)
- Added support for
devices
flag to Trainer (#8440) - Added private
prevent_trainer_and_dataloaders_deepcopy
context manager on theLightningModule
(#8472) - Added support for providing callables to the Lightning CLI instead of types (#8400)
Changed
- Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
- Changed the
Trainer
'scheckpoint_callback
argument to allow only boolean values (#7539) - Log epoch metrics before the
on_evaluation_end
hook (#7272) - Explicitly disallow calling
self.log(on_epoch=False)
during epoch-only or single-call hooks (#7874) - Changed these
Trainer
methods to be protected:call_setup_hook
,call_configure_sharded_model
,pre_dispatch
,dispatch
,post_dispatch
,call_teardown_hook
,run_train
,run_sanity_check
,run_evaluate
,run_evaluation
,run_predict
,track_output_for_epoch_end
- Changed
metrics_to_scalars
to work with any collection or value (#7888) - Changed
clip_grad_norm
to usetorch.nn.utils.clip_grad_norm_
(#7025) - Validation is now always run inside the training epoch scope (#7357)
ModelCheckpoint
now runs at the end of the training epoch by default (#8389)EarlyStopping
now runs at the end of the training epoch by default (#8286)- Refactored Loops
- Moved attributes
global_step
,current_epoch
,max/min_steps
,max/min_epochs
,batch_idx
, andtotal_batch_idx
to TrainLoop (#7437) - Refactored result handling in training loop (#7506)
- Moved attributes
hiddens
andsplit_idx
to TrainLoop (#7507) - Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
- Simplified "should run validation" logic (#7682)
- Simplified logic for updating the learning rate for schedulers (#7682)
- Removed the
on_epoch
guard from the "should stop" validation check (#7701) - Refactored internal loop interface; added new classes
FitLoop
,TrainingEpochLoop
,TrainingBatchLoop
(#7871, #8077) - Removed
pytorch_lightning/trainer/training_loop.py
(#7985) - Refactored evaluation loop interface; added new classes
DataLoaderLoop
,EvaluationLoop
,EvaluationEpochLoop
(#7990, #8077) - Removed
pytorch_lightning/trainer/evaluation_loop.py
(#8056) - Restricted public access to several internal functions (#8024)
- Refactored trainer
_run_*
functions and separate evaluation loops (#8065) - Refactored prediction loop interface; added new classes
PredictionLoop
,PredictionEpochLoop
(#7700, #8077) - Removed
pytorch_lightning/trainer/predict_loop.py
(#8094) - Moved result teardown to the loops (#8245)
- Improve
Loop
API to better handle childrenstate_dict
andprogress
(#8334)
- Moved attributes
- Refactored logging
- Renamed and moved
core/step_result.py
totrainer/connectors/logger_connector/result.py
(#7736) - Dramatically simplify the
LoggerConnector
(#7882) trainer.{logged,progress_bar,callback}_metrics
are now updated on-demand (#7882)- Completely overhaul the
Result
object in favor ofResultMetric
(#7882) - Improve epoch-level reduction time and overall memory usage (#7882)
- Allow passing
self.log(batch_size=...)
(#7891) - Each of the training loops now keeps its own results collection (#7891)
- Remove
EpochResultStore
andHookResultStore
in favor ofResultCollection
(#7909) - Remove
MetricsHolder
(#7909)
- Renamed and moved
- Moved
ignore_scalar_return_in_dp
warning suppression to the DataParallelPlugin class (#7421) - Changed the behaviour when logging evaluation step metrics to no longer append
/epoch_*
to the metric name (#7351) - Raised
ValueError
when aNone
value isself.log
-ed (#7771) - Changed
resolve_training_type_plugins
to allow settingnum_nodes
andsync_batchnorm
fromTrainer
setting (#7026) - Default
seed_everything(workers=True)
in theLightningCLI
(#7504) - Changed
model.state_dict()
inCheckpointConnector
to allowtraining_type_plugin
to customize the model'sstate_dict()
(#7474) MLflowLogger
now uses the env variableMLFLOW_TRACKING_URI
as default tracking URI (#7457)- Changed
Trainer
arg and functionality fromreload_dataloaders_every_epoch
toreload_dataloaders_every_n_epochs
(#5043) - Changed
WandbLogger(log_model={True/'all'})
to log models as artifacts (#6231) - MLFlowLogger now accepts
run_name
as an constructor argument (#7622) - Changed
teardown()
inAccelerator
to allowtraining_type_plugin
to customizeteardown
logic (#7579) Trainer.fit
now raises an error when using manual optimization with unsupp...
Standard weekly patch release
[1.3.8] - 2021-07-01
Fixed
- Fixed a sync deadlock when checkpointing a
LightningModule
that uses a torchmetrics 0.4Metric
(#8218) - Fixed compatibility TorchMetrics v0.4 (#8206)
- Added torchelastic check when sanitizing GPUs (#8095)
- Fixed a DDP info message that was never shown (#8111)
- Fixed metrics deprecation message at module import level (#8163)
- Fixed a bug where an infinite recursion would be triggered when using the
BaseFinetuning
callback on a model that contains aModuleDict
(#8170) - Added a mechanism to detect
deadlock
forDDP
when only 1 process trigger anException
. The mechanism willkill the processes
when it happens (#8167) - Fixed NCCL error when selecting non-consecutive device ids (#8165)
- Fixed SWA to also work with
IterableDataset
(#8172)
Contributors
@GabrielePicco @SeanNaren @ethanwharris @carmocca @tchaton @justusschock
Hotfix Patch Release
[1.3.7post0] - 2021-06-23
Fixed
- Fixed backward compatibility of moved functions
rank_zero_warn
andrank_zero_deprecation
(#8085)
Contributors
Standard weekly patch release
[1.3.7] - 2021-06-22
Fixed
- Fixed a bug where skipping an optimizer while using amp causes amp to trigger an assertion error (#7975)
This conversation was marked as resolved by carmocca - Fixed deprecation messages not showing due to incorrect stacklevel (#8002, #8005)
- Fixed setting a
DistributedSampler
when using a distributed plugin in a custom accelerator (#7814) - Improved
PyTorchProfiler
chrome traces names (#8009) - Fixed moving the best score to device in
EarlyStopping
callback for TPU devices (#7959)
Contributors
Standard weekly patch release
[1.3.6] - 2021-06-15
Fixed
- Fixed logs overwriting issue for remote filesystems (#7889)
- Fixed
DataModule.prepare_data
could only be called on the global rank 0 process (#7945) - Fixed setting
worker_init_fn
to seed dataloaders correctly when using DDP (#7942) - Fixed
BaseFinetuning
callback to properly handle parent modules w/ parameters (#7931)
Contributors
@awaelchli @Borda @kaushikb11 @Queuecumber @SeanNaren @senarvi @speediedan
Standard weekly patch release
[1.3.5] - 2021-06-08
Added
- Added warning to Training Step output (#7779)
Fixed
- Fixed LearningRateMonitor + BackboneFinetuning (#7835)
- Minor improvements to
apply_to_collection
and type signature oflog_dict
(#7851) - Fixed docker versions (#7834)
- Fixed sharded training check for fp16 precision (#7825)
- Fixed support for torch Module type hints in LightningCLI (#7807)
Changed
- Move
training_output
validation to aftertrain_step_end
(#7868)
Contributors
@Borda, @justusschock, @kandluis, @mauvilsa, @shuyingsunshine21, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
Standard weekly patch release
[1.3.3] - 2021-05-26
Changed
- Changed calling of
untoggle_optimizer(opt_idx)
out of the closure function (#7563)
Fixed
- Fixed
ProgressBar
pickling after callingtrainer.predict
(#7608) - Fixed broadcasting in multi-node, multi-gpu DDP using torch 1.7 (#7592)
- Fixed dataloaders are not reset when tuning the model (#7566)
- Fixed print errors in
ProgressBar
whentrainer.fit
is not called (#7674) - Fixed global step update when the epoch is skipped (#7677)
- Fixed training loop total batch counter when accumulate grad batches was enabled (#7692)
Contributors
@carmocca @kaushikb11 @ryanking13 @Lucklyric @ajtritt @yifuwang
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.3.2] - 2021-05-18
Changed
DataModule
s now avoid duplicate{setup,teardown,prepare_data}
calls for the same stage (#7238)
Fixed
- Fixed parsing of multiple training dataloaders (#7433)
- Fixed recursive passing of
wrong_type
keyword argument inpytorch_lightning.utilities.apply_to_collection
(#7433) - Fixed setting correct
DistribType
forddp_cpu
(spawn) backend (#7492) - Fixed incorrect number of calls to LR scheduler when
check_val_every_n_epoch > 1
(#7032)
Contributors
@alanhdu @carmocca @justusschock @tkng
If we forgot someone due to not matching commit email with GitHub account, let us know :]