Releases · Lightning-AI/pytorch-lightning

03 Aug 14:14

awaelchli

1.4.1

e6b0277

Standard weekly patch release

[1.4.1] - 2021-08-03

Fixed trainer.fit_loop.split_idx always returning None (#8601)
Fixed references for ResultCollection.extra (#8622)
Fixed reference issues during epoch end result collection (#8621)
Fixed horovod auto-detection when horovod is not installed and the launcher is mpirun (#8610)
Fixed an issue with training_step outputs not getting collected correctly for training_epoch_end (#8613)
Fixed distributed types support for CPUs (#8667)
Fixed a deadlock issue with DDP and torchelastic (#8655)
Fixed accelerator=ddp choice for CPU (#8645)

Contributors

@awaelchli, @Borda, @carmocca, @kaushikb11, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

awaelchli, Borda, and 3 other contributors

Assets 4

27 Jul 15:30

kaushikb11

1.4.0

c7f8c8c

TPU Pod Training, IPU Accelerator, DeepSpeed Infinity, Fully Sharded Data Parallel

Today we are excited to announce Lightning 1.4, introducing support for TPU pods, XLA profiling, IPUs, and new plugins to reach 10+ billion parameters, including Deep Speed Infinity, Fully Sharded Data-Parallel and more!

https://devblog.pytorchlightning.ai/announcing-lightning-1-4-8cd20482aee9

[1.4.0] - 2021-07-27

Added

Added extract_batch_size utility and corresponding tests to extract batch dimension from multiple batch types (#8357)
Added support for named parameter groups in LearningRateMonitor (#7987)
Added dataclass support for pytorch_lightning.utilities.apply_to_collection (#7935)
Added support to LightningModule.to_torchscript for saving to custom filesystems with fsspec (#7617)
Added KubeflowEnvironment for use with the PyTorchJob operator in Kubeflow
Added LightningCLI support for config files on object stores (#7521)
Added ModelPruning(prune_on_train_epoch_end=True|False) to choose when to apply pruning (#7704)
Added support for checkpointing based on a provided time interval during training (#7515)
Progress tracking
- Added dataclasses for progress tracking (#6603, #7574, #8140, #8362)
- Add {,load_}state_dict to the progress tracking dataclasses (#8140)
- Connect the progress tracking dataclasses to the loops (#8244, #8362)
- Do not reset the progress tracking dataclasses total counters (#8475)
Added support for passing a LightningDataModule positionally as the second argument to trainer.{validate,test,predict} (#7431)
Added argument trainer.predict(ckpt_path) (#7430)
Added clip_grad_by_value support for TPUs (#7025)
Added support for passing any class to is_overridden (#7918)
Added sub_dir parameter to TensorBoardLogger (#6195)
Added correct dataloader_idx to batch transfer hooks (#6241)
Added include_none=bool argument to apply_to_collection (#7769)
Added apply_to_collections to apply a function to two zipped collections (#7769)
Added ddp_fully_sharded support (#7487)
Added should_rank_save_checkpoint property to Training Plugins (#7684)
Added log_grad_norm hook to LightningModule to customize the logging of gradient norms (#7873)
Added save_config_filename init argument to LightningCLI to ease resolving name conflicts (#7741)
Added save_config_overwrite init argument to LightningCLI to ease overwriting existing config files (#8059)
Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
Added trainer stage hooks for Training Plugins and Accelerators (#7864)
Added the on_before_optimizer_step hook (#8048)
Added IPU Accelerator (#7867)
Fault-tolerant training
- Added {,load_}state_dict to ResultCollection (#7948)
- Added {,load_}state_dict to Loops (#8197)
- Set Loop.restarting=False at the end of the first iteration (#8362)
- Save the loops state with the checkpoint (opt-in) (#8362)
- Save a checkpoint to restore the state on exception (opt-in) (#8362)
- Added state_dict and load_state_dict utilities for CombinedLoader + utilities for dataloader (#8364)
Added rank_zero_only to LightningModule.log function (#7966)
Added metric_attribute to LightningModule.log function (#7966)
Added a warning if Trainer(log_every_n_steps) is a value too high for the training dataloader (#7734)
Added LightningCLI support for argument links applied on instantiation (#7895)
Added LightningCLI support for configurable callbacks that should always be present (#7964)
Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
Added support for torch.nn.UninitializedParameter in ModelSummary (#7642)
Added support LightningModule.save_hyperparameters when LightningModule is a dataclass (#7992)
Added support for overriding optimizer_zero_grad and optimizer_step when using accumulate_grad_batches (#7980)
Added logger boolean flag to save_hyperparameters (#7960)
Added support for calling scripts using the module syntax (python -m package.script) (#8073)
Added support for optimizers and learning rate schedulers to LightningCLI (#8093)
Added XLA Profiler (#8014)
Added PrecisionPlugin.{pre,post}_backward (#8328)
Added on_load_checkpoint and on_save_checkpoint hooks to the PrecisionPlugin base class (#7831)
Added max_depth parameter in ModelSummary (#8062)
Added XLAStatsMonitor callback (#8235)
Added restore function and restarting attribute to base Loop (#8247)
Added FastForwardSampler and CaptureIterableDataset (#8307)
Added support for save_hyperparameters in LightningDataModule (#3792)
Added the ModelCheckpoint(save_on_train_epoch_end) to choose when to run the saving logic (#8389)
Added LSFEnvironment for distributed training with the LSF resource manager jsrun (#5102)
Added support for accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto' (#7808)
Added tpu_spawn_debug to plugin registry (#7933)
Enabled traditional/manual launching of DDP processes through LOCAL_RANK and NODE_RANK environment variable assignments (#7480)
Added quantize_on_fit_end argument to QuantizationAwareTraining (#8464)
Added experimental support for loop specialization (#8226)
Added support for devices flag to Trainer (#8440)
Added private prevent_trainer_and_dataloaders_deepcopy context manager on the LightningModule (#8472)
Added support for providing callables to the Lightning CLI instead of types (#8400)

Changed

Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
Changed the Trainer's checkpoint_callback argument to allow only boolean values (#7539)
Log epoch metrics before the on_evaluation_end hook (#7272)
Explicitly disallow calling self.log(on_epoch=False) during epoch-only or single-call hooks (#7874)
Changed these Trainer methods to be protected: call_setup_hook, call_configure_sharded_model, pre_dispatch, dispatch, post_dispatch, call_teardown_hook, run_train, run_sanity_check, run_evaluate, run_evaluation, run_predict, track_output_for_epoch_end
Changed metrics_to_scalars to work with any collection or value (#7888)
Changed clip_grad_norm to use torch.nn.utils.clip_grad_norm_ (#7025)
Validation is now always run inside the training epoch scope (#7357)
ModelCheckpoint now runs at the end of the training epoch by default (#8389)
EarlyStopping now runs at the end of the training epoch by default (#8286)
Refactored Loops
- Moved attributes global_step, current_epoch, max/min_steps, max/min_epochs, batch_idx, and total_batch_idx to TrainLoop (#7437)
- Refactored result handling in training loop (#7506)
- Moved attributes hiddens and split_idx to TrainLoop (#7507)
- Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
- Simplified "should run validation" logic (#7682)
- Simplified logic for updating the learning rate for schedulers (#7682)
- Removed the on_epoch guard from the "should stop" validation check (#7701)
- Refactored internal loop interface; added new classes FitLoop, TrainingEpochLoop, TrainingBatchLoop (#7871, #8077)
- Removed pytorch_lightning/trainer/training_loop.py (#7985)
- Refactored evaluation loop interface; added new classes DataLoaderLoop, EvaluationLoop, EvaluationEpochLoop (#7990, #8077)
- Removed pytorch_lightning/trainer/evaluation_loop.py (#8056)
- Restricted public access to several internal functions (#8024)
- Refactored trainer _run_* functions and separate evaluation loops (#8065)
- Refactored prediction loop interface; added new classes PredictionLoop, PredictionEpochLoop (#7700, #8077)
- Removed pytorch_lightning/trainer/predict_loop.py (#8094)
- Moved result teardown to the loops (#8245)
- Improve Loop API to better handle children state_dict and progress (#8334)
Refactored logging
- Renamed and moved core/step_result.py to trainer/connectors/logger_connector/result.py (#7736)
- Dramatically simplify the LoggerConnector (#7882)
- trainer.{logged,progress_bar,callback}_metrics are now updated on-demand (#7882)
- Completely overhaul the Result object in favor of ResultMetric (#7882)
- Improve epoch-level reduction time and overall memory usage (#7882)
- Allow passing self.log(batch_size=...) (#7891)
- Each of the training loops now keeps its own results collection (#7891)
- Remove EpochResultStore and HookResultStore in favor of ResultCollection (#7909)
- Remove MetricsHolder (#7909)
Moved ignore_scalar_return_in_dp warning suppression to the DataParallelPlugin class (#7421)
Changed the behaviour when logging evaluation step metrics to no longer append /epoch_* to the metric name (#7351)
Raised ValueError when a None value is self.log-ed (#7771)
Changed resolve_training_type_plugins to allow setting num_nodes and sync_batchnorm from Trainer setting (#7026)
Default seed_everything(workers=True) in the LightningCLI (#7504)
Changed model.state_dict() in CheckpointConnector to allow training_type_plugin to customize the model's state_dict() (#7474)
MLflowLogger now uses the env variable MLFLOW_TRACKING_URI as default tracking URI (#7457)
Changed Trainer arg and functionality from reload_dataloaders_every_epoch to reload_dataloaders_every_n_epochs (#5043)
Changed WandbLogger(log_model={True/'all'}) to log models as artifacts (#6231)
MLFlowLogger now accepts run_name as an constructor argument (#7622)
Changed teardown() in Accelerator to allow training_type_plugin to customize teardown logic (#7579)
Trainer.fit now raises an error when using manual optimization with unsupp...

Assets 4

01 Jul 13:55

awaelchli

1.3.8

7b3bf48

Standard weekly patch release

[1.3.8] - 2021-07-01

Fixed

Fixed a sync deadlock when checkpointing a LightningModule that uses a torchmetrics 0.4 Metric (#8218)
Fixed compatibility TorchMetrics v0.4 (#8206)
Added torchelastic check when sanitizing GPUs (#8095)
Fixed a DDP info message that was never shown (#8111)
Fixed metrics deprecation message at module import level (#8163)
Fixed a bug where an infinite recursion would be triggered when using the BaseFinetuning callback on a model that contains a ModuleDict (#8170)
Added a mechanism to detect deadlock for DDP when only 1 process trigger an Exception. The mechanism will kill the processes when it happens (#8167)
Fixed NCCL error when selecting non-consecutive device ids (#8165)
Fixed SWA to also work with IterableDataset (#8172)

Contributors

@GabrielePicco @SeanNaren @ethanwharris @carmocca @tchaton @justusschock

Assets 4

23 Jun 13:03

awaelchli

1.3.7post0

0ba147d

Hotfix Patch Release

[1.3.7post0] - 2021-06-23

Fixed

Fixed backward compatibility of moved functions rank_zero_warn and rank_zero_deprecation (#8085)

Contributors

@kaushikb11 @carmocca

Assets 4

22 Jun 14:08

awaelchli

1.3.7

1a6709d

Standard weekly patch release

[1.3.7] - 2021-06-22

Fixed

Fixed a bug where skipping an optimizer while using amp causes amp to trigger an assertion error (#7975)
This conversation was marked as resolved by carmocca
Fixed deprecation messages not showing due to incorrect stacklevel (#8002, #8005)
Fixed setting a DistributedSampler when using a distributed plugin in a custom accelerator (#7814)
Improved PyTorchProfiler chrome traces names (#8009)
Fixed moving the best score to device in EarlyStopping callback for TPU devices (#7959)

Contributors

@yifuwang @kaushikb11 @ajtritt @carmocca @tchaton

Assets 4

17 Jun 16:15

carmocca

1.3.6

808534f

Standard weekly patch release

[1.3.6] - 2021-06-15

Fixed

Fixed logs overwriting issue for remote filesystems (#7889)
Fixed DataModule.prepare_data could only be called on the global rank 0 process (#7945)
Fixed setting worker_init_fn to seed dataloaders correctly when using DDP (#7942)
Fixed BaseFinetuning callback to properly handle parent modules w/ parameters (#7931)

Contributors

@awaelchli @Borda @kaushikb11 @Queuecumber @SeanNaren @senarvi @speediedan

Assets 4

09 Jun 08:53

SeanNaren

1.3.5

c292788

Standard weekly patch release

[1.3.5] - 2021-06-08

Added

Added warning to Training Step output (#7779)

Fixed

Fixed LearningRateMonitor + BackboneFinetuning (#7835)
Minor improvements to apply_to_collection and type signature of log_dict (#7851)
Fixed docker versions (#7834)
Fixed sharded training check for fp16 precision (#7825)
Fixed support for torch Module type hints in LightningCLI (#7807)

Changed

Move training_output validation to after train_step_end (#7868)

Contributors

@Borda, @justusschock, @kandluis, @mauvilsa, @shuyingsunshine21, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Assets 4

03 Jun 14:57

kaushikb11

1.3.4

61525f6

Standard weekly patch release

[1.3.4] - 2021-06-01

Fixed

Fixed info message when max training time reached (#7780)
Fixed missing __len__ method to IndexBatchSamplerWrapper (#7681)

Contributors

@awaelchli @kaushikb11

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Assets 4

26 May 14:59

awaelchli

1.3.3

e0850b3

Standard weekly patch release

[1.3.3] - 2021-05-26

Changed

Changed calling of untoggle_optimizer(opt_idx) out of the closure function (#7563)

Fixed

Fixed ProgressBar pickling after calling trainer.predict (#7608)
Fixed broadcasting in multi-node, multi-gpu DDP using torch 1.7 (#7592)
Fixed dataloaders are not reset when tuning the model (#7566)
Fixed print errors in ProgressBar when trainer.fit is not called (#7674)
Fixed global step update when the epoch is skipped (#7677)
Fixed training loop total batch counter when accumulate grad batches was enabled (#7692)

Contributors

@carmocca @kaushikb11 @ryanking13 @Lucklyric @ajtritt @yifuwang

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Assets 4

19 May 20:23

edgarriba

1.3.2

6b81cd8

Standard weekly patch release

[1.3.2] - 2021-05-18

Changed

DataModules now avoid duplicate {setup,teardown,prepare_data} calls for the same stage (#7238)

Fixed

Fixed parsing of multiple training dataloaders (#7433)
Fixed recursive passing of wrong_type keyword argument in pytorch_lightning.utilities.apply_to_collection (#7433)
Fixed setting correct DistribType for ddp_cpu (spawn) backend (#7492)
Fixed incorrect number of calls to LR scheduler when check_val_every_n_epoch > 1 (#7032)

Contributors

@alanhdu @carmocca @justusschock @tkng

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Assets 4

Releases: Lightning-AI/pytorch-lightning

Standard weekly patch release

[1.4.1] - 2021-08-03

Contributors

Contributors

Uh oh!

TPU Pod Training, IPU Accelerator, DeepSpeed Infinity, Fully Sharded Data Parallel

[1.4.0] - 2021-07-27

Added

Changed

Uh oh!

Standard weekly patch release

[1.3.8] - 2021-07-01

Fixed

Contributors

Uh oh!

Hotfix Patch Release

[1.3.7post0] - 2021-06-23

Fixed

Contributors

Uh oh!

Standard weekly patch release

[1.3.7] - 2021-06-22

Fixed

Contributors

Uh oh!

Standard weekly patch release

[1.3.6] - 2021-06-15

Fixed

Contributors

Uh oh!

Standard weekly patch release

[1.3.5] - 2021-06-08

Added

Fixed

Changed

Contributors

Uh oh!

Standard weekly patch release

[1.3.4] - 2021-06-01

Fixed

Contributors

Uh oh!

Standard weekly patch release

[1.3.3] - 2021-05-26

Changed

Fixed

Contributors

Uh oh!

Standard weekly patch release

[1.3.2] - 2021-05-18

Changed

Fixed

Contributors

Uh oh!