Skip to content

Releases: Lightning-AI/pytorch-lightning

Standard weekly patch release

03 Aug 14:14
Compare
Choose a tag to compare

[1.4.1] - 2021-08-03

  • Fixed trainer.fit_loop.split_idx always returning None (#8601)
  • Fixed references for ResultCollection.extra (#8622)
  • Fixed reference issues during epoch end result collection (#8621)
  • Fixed horovod auto-detection when horovod is not installed and the launcher is mpirun (#8610)
  • Fixed an issue with training_step outputs not getting collected correctly for training_epoch_end (#8613)
  • Fixed distributed types support for CPUs (#8667)
  • Fixed a deadlock issue with DDP and torchelastic (#8655)
  • Fixed accelerator=ddp choice for CPU (#8645)

Contributors

@awaelchli, @Borda, @carmocca, @kaushikb11, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

TPU Pod Training, IPU Accelerator, DeepSpeed Infinity, Fully Sharded Data Parallel

27 Jul 15:30
c7f8c8c
Compare
Choose a tag to compare

Today we are excited to announce Lightning 1.4, introducing support for TPU pods, XLA profiling, IPUs, and new plugins to reach 10+ billion parameters, including Deep Speed Infinity, Fully Sharded Data-Parallel and more!

https://devblog.pytorchlightning.ai/announcing-lightning-1-4-8cd20482aee9

[1.4.0] - 2021-07-27

Added

  • Added extract_batch_size utility and corresponding tests to extract batch dimension from multiple batch types (#8357)
  • Added support for named parameter groups in LearningRateMonitor (#7987)
  • Added dataclass support for pytorch_lightning.utilities.apply_to_collection (#7935)
  • Added support to LightningModule.to_torchscript for saving to custom filesystems with fsspec (#7617)
  • Added KubeflowEnvironment for use with the PyTorchJob operator in Kubeflow
  • Added LightningCLI support for config files on object stores (#7521)
  • Added ModelPruning(prune_on_train_epoch_end=True|False) to choose when to apply pruning (#7704)
  • Added support for checkpointing based on a provided time interval during training (#7515)
  • Progress tracking
    • Added dataclasses for progress tracking (#6603, #7574, #8140, #8362)
    • Add {,load_}state_dict to the progress tracking dataclasses (#8140)
    • Connect the progress tracking dataclasses to the loops (#8244, #8362)
    • Do not reset the progress tracking dataclasses total counters (#8475)
  • Added support for passing a LightningDataModule positionally as the second argument to trainer.{validate,test,predict} (#7431)
  • Added argument trainer.predict(ckpt_path) (#7430)
  • Added clip_grad_by_value support for TPUs (#7025)
  • Added support for passing any class to is_overridden (#7918)
  • Added sub_dir parameter to TensorBoardLogger (#6195)
  • Added correct dataloader_idx to batch transfer hooks (#6241)
  • Added include_none=bool argument to apply_to_collection (#7769)
  • Added apply_to_collections to apply a function to two zipped collections (#7769)
  • Added ddp_fully_sharded support (#7487)
  • Added should_rank_save_checkpoint property to Training Plugins (#7684)
  • Added log_grad_norm hook to LightningModule to customize the logging of gradient norms (#7873)
  • Added save_config_filename init argument to LightningCLI to ease resolving name conflicts (#7741)
  • Added save_config_overwrite init argument to LightningCLI to ease overwriting existing config files (#8059)
  • Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
  • Added trainer stage hooks for Training Plugins and Accelerators (#7864)
  • Added the on_before_optimizer_step hook (#8048)
  • Added IPU Accelerator (#7867)
  • Fault-tolerant training
    • Added {,load_}state_dict to ResultCollection (#7948)
    • Added {,load_}state_dict to Loops (#8197)
    • Set Loop.restarting=False at the end of the first iteration (#8362)
    • Save the loops state with the checkpoint (opt-in) (#8362)
    • Save a checkpoint to restore the state on exception (opt-in) (#8362)
    • Added state_dict and load_state_dict utilities for CombinedLoader + utilities for dataloader (#8364)
  • Added rank_zero_only to LightningModule.log function (#7966)
  • Added metric_attribute to LightningModule.log function (#7966)
  • Added a warning if Trainer(log_every_n_steps) is a value too high for the training dataloader (#7734)
  • Added LightningCLI support for argument links applied on instantiation (#7895)
  • Added LightningCLI support for configurable callbacks that should always be present (#7964)
  • Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
  • Added support for torch.nn.UninitializedParameter in ModelSummary (#7642)
  • Added support LightningModule.save_hyperparameters when LightningModule is a dataclass (#7992)
  • Added support for overriding optimizer_zero_grad and optimizer_step when using accumulate_grad_batches (#7980)
  • Added logger boolean flag to save_hyperparameters (#7960)
  • Added support for calling scripts using the module syntax (python -m package.script) (#8073)
  • Added support for optimizers and learning rate schedulers to LightningCLI (#8093)
  • Added XLA Profiler (#8014)
  • Added PrecisionPlugin.{pre,post}_backward (#8328)
  • Added on_load_checkpoint and on_save_checkpoint hooks to the PrecisionPlugin base class (#7831)
  • Added max_depth parameter in ModelSummary (#8062)
  • Added XLAStatsMonitor callback (#8235)
  • Added restore function and restarting attribute to base Loop (#8247)
  • Added FastForwardSampler and CaptureIterableDataset (#8307)
  • Added support for save_hyperparameters in LightningDataModule (#3792)
  • Added the ModelCheckpoint(save_on_train_epoch_end) to choose when to run the saving logic (#8389)
  • Added LSFEnvironment for distributed training with the LSF resource manager jsrun (#5102)
  • Added support for accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto' (#7808)
  • Added tpu_spawn_debug to plugin registry (#7933)
  • Enabled traditional/manual launching of DDP processes through LOCAL_RANK and NODE_RANK environment variable assignments (#7480)
  • Added quantize_on_fit_end argument to QuantizationAwareTraining (#8464)
  • Added experimental support for loop specialization (#8226)
  • Added support for devices flag to Trainer (#8440)
  • Added private prevent_trainer_and_dataloaders_deepcopy context manager on the LightningModule (#8472)
  • Added support for providing callables to the Lightning CLI instead of types (#8400)

Changed

  • Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
  • Changed the Trainer's checkpoint_callback argument to allow only boolean values (#7539)
  • Log epoch metrics before the on_evaluation_end hook (#7272)
  • Explicitly disallow calling self.log(on_epoch=False) during epoch-only or single-call hooks (#7874)
  • Changed these Trainer methods to be protected: call_setup_hook, call_configure_sharded_model, pre_dispatch, dispatch, post_dispatch, call_teardown_hook, run_train, run_sanity_check, run_evaluate, run_evaluation, run_predict, track_output_for_epoch_end
  • Changed metrics_to_scalars to work with any collection or value (#7888)
  • Changed clip_grad_norm to use torch.nn.utils.clip_grad_norm_ (#7025)
  • Validation is now always run inside the training epoch scope (#7357)
  • ModelCheckpoint now runs at the end of the training epoch by default (#8389)
  • EarlyStopping now runs at the end of the training epoch by default (#8286)
  • Refactored Loops
    • Moved attributes global_step, current_epoch, max/min_steps, max/min_epochs, batch_idx, and total_batch_idx to TrainLoop (#7437)
    • Refactored result handling in training loop (#7506)
    • Moved attributes hiddens and split_idx to TrainLoop (#7507)
    • Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
    • Simplified "should run validation" logic (#7682)
    • Simplified logic for updating the learning rate for schedulers (#7682)
    • Removed the on_epoch guard from the "should stop" validation check (#7701)
    • Refactored internal loop interface; added new classes FitLoop, TrainingEpochLoop, TrainingBatchLoop (#7871, #8077)
    • Removed pytorch_lightning/trainer/training_loop.py (#7985)
    • Refactored evaluation loop interface; added new classes DataLoaderLoop, EvaluationLoop, EvaluationEpochLoop (#7990, #8077)
    • Removed pytorch_lightning/trainer/evaluation_loop.py (#8056)
    • Restricted public access to several internal functions (#8024)
    • Refactored trainer _run_* functions and separate evaluation loops (#8065)
    • Refactored prediction loop interface; added new classes PredictionLoop, PredictionEpochLoop (#7700, #8077)
    • Removed pytorch_lightning/trainer/predict_loop.py (#8094)
    • Moved result teardown to the loops (#8245)
    • Improve Loop API to better handle children state_dict and progress (#8334)
  • Refactored logging
    • Renamed and moved core/step_result.py to trainer/connectors/logger_connector/result.py (#7736)
    • Dramatically simplify the LoggerConnector (#7882)
    • trainer.{logged,progress_bar,callback}_metrics are now updated on-demand (#7882)
    • Completely overhaul the Result object in favor of ResultMetric (#7882)
    • Improve epoch-level reduction time and overall memory usage (#7882)
    • Allow passing self.log(batch_size=...) (#7891)
    • Each of the training loops now keeps its own results collection (#7891)
    • Remove EpochResultStore and HookResultStore in favor of ResultCollection (#7909)
    • Remove MetricsHolder (#7909)
  • Moved ignore_scalar_return_in_dp warning suppression to the DataParallelPlugin class (#7421)
  • Changed the behaviour when logging evaluation step metrics to no longer append /epoch_* to the metric name (#7351)
  • Raised ValueError when a None value is self.log-ed (#7771)
  • Changed resolve_training_type_plugins to allow setting num_nodes and sync_batchnorm from Trainer setting (#7026)
  • Default seed_everything(workers=True) in the LightningCLI (#7504)
  • Changed model.state_dict() in CheckpointConnector to allow training_type_plugin to customize the model's state_dict() (#7474)
  • MLflowLogger now uses the env variable MLFLOW_TRACKING_URI as default tracking URI (#7457)
  • Changed Trainer arg and functionality from reload_dataloaders_every_epoch to reload_dataloaders_every_n_epochs (#5043)
  • Changed WandbLogger(log_model={True/'all'}) to log models as artifacts (#6231)
  • MLFlowLogger now accepts run_name as an constructor argument (#7622)
  • Changed teardown() in Accelerator to allow training_type_plugin to customize teardown logic (#7579)
  • Trainer.fit now raises an error when using manual optimization with unsupp...
Read more

Standard weekly patch release

01 Jul 13:55
Compare
Choose a tag to compare

[1.3.8] - 2021-07-01

Fixed

  • Fixed a sync deadlock when checkpointing a LightningModule that uses a torchmetrics 0.4 Metric (#8218)
  • Fixed compatibility TorchMetrics v0.4 (#8206)
  • Added torchelastic check when sanitizing GPUs (#8095)
  • Fixed a DDP info message that was never shown (#8111)
  • Fixed metrics deprecation message at module import level (#8163)
  • Fixed a bug where an infinite recursion would be triggered when using the BaseFinetuning callback on a model that contains a ModuleDict (#8170)
  • Added a mechanism to detect deadlock for DDP when only 1 process trigger an Exception. The mechanism will kill the processes when it happens (#8167)
  • Fixed NCCL error when selecting non-consecutive device ids (#8165)
  • Fixed SWA to also work with IterableDataset (#8172)

Contributors

@GabrielePicco @SeanNaren @ethanwharris @carmocca @tchaton @justusschock

Hotfix Patch Release

23 Jun 13:03
Compare
Choose a tag to compare

[1.3.7post0] - 2021-06-23

Fixed

  • Fixed backward compatibility of moved functions rank_zero_warn and rank_zero_deprecation (#8085)

Contributors

@kaushikb11 @carmocca

Standard weekly patch release

22 Jun 14:08
Compare
Choose a tag to compare

[1.3.7] - 2021-06-22

Fixed

  • Fixed a bug where skipping an optimizer while using amp causes amp to trigger an assertion error (#7975)
    This conversation was marked as resolved by carmocca
  • Fixed deprecation messages not showing due to incorrect stacklevel (#8002, #8005)
  • Fixed setting a DistributedSampler when using a distributed plugin in a custom accelerator (#7814)
  • Improved PyTorchProfiler chrome traces names (#8009)
  • Fixed moving the best score to device in EarlyStopping callback for TPU devices (#7959)

Contributors

@yifuwang @kaushikb11 @ajtritt @carmocca @tchaton

Standard weekly patch release

17 Jun 16:15
Compare
Choose a tag to compare

[1.3.6] - 2021-06-15

Fixed

  • Fixed logs overwriting issue for remote filesystems (#7889)
  • Fixed DataModule.prepare_data could only be called on the global rank 0 process (#7945)
  • Fixed setting worker_init_fn to seed dataloaders correctly when using DDP (#7942)
  • Fixed BaseFinetuning callback to properly handle parent modules w/ parameters (#7931)

Contributors

@awaelchli @Borda @kaushikb11 @Queuecumber @SeanNaren @senarvi @speediedan

Standard weekly patch release

09 Jun 08:53
Compare
Choose a tag to compare

[1.3.5] - 2021-06-08

Added

  • Added warning to Training Step output (#7779)

Fixed

  • Fixed LearningRateMonitor + BackboneFinetuning (#7835)
  • Minor improvements to apply_to_collection and type signature of log_dict (#7851)
  • Fixed docker versions (#7834)
  • Fixed sharded training check for fp16 precision (#7825)
  • Fixed support for torch Module type hints in LightningCLI (#7807)

Changed

  • Move training_output validation to after train_step_end (#7868)

Contributors

@Borda, @justusschock, @kandluis, @mauvilsa, @shuyingsunshine21, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

03 Jun 14:57
Compare
Choose a tag to compare

[1.3.4] - 2021-06-01

Fixed

  • Fixed info message when max training time reached (#7780)
  • Fixed missing __len__ method to IndexBatchSamplerWrapper (#7681)

Contributors

@awaelchli @kaushikb11

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

26 May 14:59
Compare
Choose a tag to compare

[1.3.3] - 2021-05-26

Changed

  • Changed calling of untoggle_optimizer(opt_idx) out of the closure function (#7563)

Fixed

  • Fixed ProgressBar pickling after calling trainer.predict (#7608)
  • Fixed broadcasting in multi-node, multi-gpu DDP using torch 1.7 (#7592)
  • Fixed dataloaders are not reset when tuning the model (#7566)
  • Fixed print errors in ProgressBar when trainer.fit is not called (#7674)
  • Fixed global step update when the epoch is skipped (#7677)
  • Fixed training loop total batch counter when accumulate grad batches was enabled (#7692)

Contributors

@carmocca @kaushikb11 @ryanking13 @Lucklyric @ajtritt @yifuwang

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

19 May 20:23
Compare
Choose a tag to compare

[1.3.2] - 2021-05-18

Changed

  • DataModules now avoid duplicate {setup,teardown,prepare_data} calls for the same stage (#7238)

Fixed

  • Fixed parsing of multiple training dataloaders (#7433)
  • Fixed recursive passing of wrong_type keyword argument in pytorch_lightning.utilities.apply_to_collection (#7433)
  • Fixed setting correct DistribType for ddp_cpu (spawn) backend (#7492)
  • Fixed incorrect number of calls to LR scheduler when check_val_every_n_epoch > 1 (#7032)

Contributors

@alanhdu @carmocca @justusschock @tkng

If we forgot someone due to not matching commit email with GitHub account, let us know :]