Skip to content

Releases: Lightning-AI/pytorch-lightning

1.0.0 - General availability

13 Oct 12:12
09c2020
Compare
Choose a tag to compare

Overview

...

Detail changes

  • Added Explained Variance Metric + metric fix (#4013)
  • Added Metric <-> Lightning Module integration tests (#4008)
  • Added parsing OS env vars in Trainer (#4022)
  • Added classification metrics (#4043)
  • Updated explained variance metric (#4024)
  • Enabled plugins (#4041)
  • Enabled custom clusters (#4048)
  • Enabled passing in custom accelerators (#4050)
  • Added LightningModule.toggle_optimizer (#4058)
  • Added LightningModule.manual_backward (#4063)

Changed

Removed

  • Removed output argument from *_batch_end hooks (#3965, #3966)
  • Removed output argument from *_epoch_end hooks (#3967)
  • Removed support for EvalResult and TrainResult (#3968)
  • Removed deprecated trainer flags: overfit_pct, log_save_interval, row_log_interval (#3969)
  • Removed deprecated early_stop_callback (#3982)
  • Removed deprecated model hooks (#3980)
  • Removed deprecated callbacks (#3979)
  • Removed trainer argument in LightningModule.backward [#4056)

Fixed

  • Fixed current_epoch property update to reflect true epoch number inside LightningDataModule, when reload_dataloaders_every_epoch=True. (#3974)
  • Fixed to print scaler value in progress bar (#4053)
  • Fixed mismatch between docstring and code regarding when on_load_checkpoint hook is called (#3996)

Contributors

@ananyahjha93, @Borda, @edenlightning, @hbredin, @rohitgr7, @SkafteNicki, @teddykoker, @williamFalcon

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Buffer release before 1.0

07 Oct 21:16
b4051e7
Compare
Choose a tag to compare

This release is a buffer in case 1.0 breaks any compatibility for people who upgrade. 0.10.0 has all the bug fixes and features of 1.0 but is 100% backward compatible. The 1.0 release following in the next 24 hours.

Overview

The major changes are:

  • Results objects are deprecated (we hated them too haha)
  • This means dataflow and logging have been decoupled

To log:

def any_step(...):
   self.log('something', i_computed)

Separately, return whatever you want from methods:

def training_step(...):
  return loss

or

def training_step(...):
   return {'loss': loss, 'whatever': [1, 'want']}

Detail changes

Added

  • Added new Metrics API. (#3868, [#3921)
  • Enable PyTorch 1.7 compatibility (#3541)
  • Added LightningModule.to_torchscript to support exporting as ScriptModule (#3258)
  • Added warning when dropping unpicklable hparams (#2874)
  • Added EMB similarity (#3349)
  • Added ModelCheckpoint.to_yaml method (#3048)
  • Allow ModelCheckpoint monitor to be None, meaning it will always save ([3630)
  • Disabled optimizers setup during testing (#3059)
  • Added support for datamodules to save and load checkpoints when training (#3563
  • Added support for datamodule in learning rate finder (#3425)
  • Added gradient clip test for native AMP (#3754)
  • Added dist lib to enable syncing anything across devices (#3762)
  • Added broadcast to TPUBackend (#3814)
  • Added XLADeviceUtils class to check XLA device type (#3274)

Changed

  • Refactored accelerator backends:
    • moved TPU xxx_step to backend (#3118)
    • refactored DDP backend forward (#3119)
    • refactored GPU backend __step (#3120)
    • refactored Horovod backend (#3121, #3122)
    • remove obscure forward call in eval + CPU backend ___step (#3123)
    • reduced all simplified forward (#3126)
    • added hook base method (#3127)
    • refactor eval loop to use hooks - use test_mode for if so we can split later (#3129)
    • moved ___step_end hooks (#3130)
    • training forward refactor (#3134)
    • training AMP scaling refactor (#3135)
    • eval step scaling factor (#3136)
    • add eval loop object to streamline eval loop (#3138)
    • refactored dataloader process hook (#3139)
    • refactored inner eval loop (#3141)
    • final inner eval loop hooks (#3154)
    • clean up hooks in run_evaluation (#3156)
    • clean up data reset (#3161)
    • expand eval loop out (#3165)
    • moved hooks around in eval loop (#3195)
    • remove _evaluate fx (#3197)
    • Trainer.fit hook clean up (#3198)
    • DDPs train hooks (#3203)
    • refactor DDP backend (#3204, #3207, #3208, #3209, #3210)
    • reduced accelerator selection (#3211)
    • group prepare data hook (#3212)
    • added data connector (#3285)
    • modular is_overridden (#3290)
    • adding Trainer.tune() (#3293)
    • move run_pretrain_routine -> setup_training (#3294)
    • move train outside of setup training (#3297)
    • move prepare_data to data connector (#3307)
    • moved accelerator router (#3309)
    • train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
    • duplicate data interface definition up into DataHooks class (#3344)
    • inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
    • all logging related calls in a connector (#3395)
    • device parser (#3400, #3405)
    • added model connector (#3407)
    • moved eval loop logging to loggers (#3408)
    • moved eval loop (#3412[#3408)
    • trainer/separate argparse (#3421, #3428, #3432)
    • move lr_finder (#3434)
    • organize args (##3435, #3442, #3447, #3448, #3449, #3456)
    • move specific accelerator code (#3457)
    • group connectors (#3472)
    • accelerator connector methods x/n (#3469, #3470, #3474)
    • merge backends (#3476, #3477, #3478, #3480, #3482)
    • apex plugin (#3502)
    • precision plugins (#3504)
    • Result - make monitor default to checkpoint_on to simplify (#3571)
    • reference to the Trainer on the LightningDataModule (#3684)
    • add .log to lightning module (#3686, #3699, #3701, #3704, #3715)
    • enable tracking original metric when step and epoch are both true (#3685)
    • deprecated results obj, added support for simpler comms (#3681)
    • move backends back to individual files (#3712)
    • fixes logging for eval steps (#3763)
    • decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806)
    • remove weight loading hack for ddp_cpu (#3808)
    • separate torchelastic from DDP (#3810)
    • separate SLURM from DDP (#3809)
    • decoupled DDP2 (#3816)
    • bug fix with logging val epoch end + monitor (#3812)
    • decoupled DDP, DDP spawn (#3733, #3817, #3819, #3927)
    • callback system and init DDP (#3836)
    • adding compute environments (#3837, [#3842)
    • epoch can now log independently (#3843)
    • test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
    • fixed init_slurm_connection causing hostname errors (#3856)
    • moves init apex from LM to apex connector (#3923)
    • moves sync bn to each backend (#3925)
    • moves configure ddp to each backend (#3924)
  • Deprecation warning (#3844)
  • Changed LearningRateLogger to LearningRateMonitor (#3251)
  • Used fsspec instead of gfile for all IO (#3320)
    • Swaped torch.load for fsspec load in DDP spawn backend (#3787)
    • Swaped torch.load for fsspec load in cloud_io loading (#3692)
    • Added support for to_disk() to use remote filepaths with fsspec (#3930)
    • Updated model_checkpoint's to_yaml to use fsspec open (#3801)
    • Fixed fsspec is inconsistant when doing fs.ls (#3805)
  • Refactor GPUStatsMonitor to improve training speed (#3257)
  • Changed IoU score behavior for classes absent in target and pred (#3098)
  • Changed IoU remove_bg bool to ignore_index optional int (#3098)
  • Changed defaults of save_top_k and save_last to None in ModelCheckpoint (#3680)
  • row_log_interval and log_save_interval are now based on training loop's global_step instead of epoch-internal batch index (#3667)
  • Silenced some warnings. verified ddp refactors (#3483)
  • Cleaning up stale logger tests (#3490)
  • Allow ModelCheckpoint monitor to be None (#3633)
  • Enable None model checkpoint default (#3669)
  • Skipped best_model_path if checkpoint_callback is None (#2962)
  • Used raise .. from .. to explicitly chain exceptions (#3750)
  • Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
  • Write predictions in LightningModule instead of EvalResult [#3882

Deprecated

  • Deprecated TrainResult and EvalResult, use self.log and self.write from the LightningModule to log metrics and write predictions. training_step can now only return a scalar (for the loss) or a dictionary with anything you want. (#3681)
  • Deprecate early_stop_callback Trainer argument (#3845)
  • Rename Trainer arguments row_log_interval >> log_every_n_steps and log_save_interval >> flush_logs_every_n_steps (#3748)

Removed

  • Removed experimental Metric API (#3868, #3943, #3949, #3946), listed changes before final removal:
    • Added EmbeddingSimilarity metric (#3349, [#3358)
    • Added hooks to metric module interface (#2528)
    • Added error when AUROC metric is used for multiclass problems (#3350)
    • Fixed ModelCheckpoint with save_top_k=-1 option not tracking the best models when a monitor metric is available (#3735)
    • Fixed counter-intuitive error being thrown in Accuracy metric for zero target tensor (#3764)
    • Fixed aggregation of metrics (#3517)
    • Fixed Metric aggregation (#3321)
    • Fixed RMSLE metric (#3188)
    • Renamed reduction to class_reduction in classification metrics (#3322)
    • Changed class_reduction similar to sklearn for classification metrics (#3322)
    • Renaming of precision recall metric (#3308)

Fixed

  • Fixed on_train_batch_start hook to end epoch early (#3700)
  • Fixed num_sanity_val_steps is clipped to limit_val_batches (#2917)
  • Fixed ONNX model save on GPU (#3145)
  • Fixed GpuUsageLogger to work on different platforms (#3008)
  • Fixed auto-scale batch size not dumping auto_lr_find parameter (#3151)
  • Fixed batch_outputs with optimizer frequencies (#3229)
  • Fixed setting batch size in LightningModule.datamodule when using auto_scale_batch_size (#3266)
  • Fixed Horovod distributed backend compatibility with native AMP (#3404)
  • Fixed batch size auto scaling exceeding the size of the dataset (#3271)
  • Fixed getting experiment_id from MLFlow only once instead of each training loop (#3394)
  • Fixed overfit_batches which now correctly disables shuffling for the training loader. (#3501)
  • Fixed gradient norm tracking for row_log_interval > 1 (#3489)
  • Fixed ModelCheckpoint name formatting ([3164)
  • Fixed auto-scale batch size (#3151)
  • Fixed example implementation of AutoEncoder (#3190)
  • Fixed invalid paths when remote logging with TensorBoard (#3236)
  • Fixed change t() to transpose() as XLA devices do not support .t() on 1-dim tensor (#3252)
  • Fixed (weights only) checkpoints loading without PL (#3287)
  • Fixed gather_all_tensors cross GPUs in DDP (#3319)
  • Fixed CometML save dir (#3419)
  • Fixed forward key metrics (#3467)
  • Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
  • Fixed global step increment in training loop when training_epoch_end hook is used (#3673)
  • Fixed dataloader shuffling not getting turned off with overfit_batches > 0 and distributed_backend = "ddp" (#3534)
  • Fixed determinism in DDPSpawnBackend when using seed_everything in main process (#3335)
  • Fixed ModelCheckpoint period to actually save every period epochs (#363...
Read more

synced BatchNorm, DataModules and final API

20 Aug 19:25
b40de54
Compare
Choose a tag to compare

Overview

The newest PyTorch Lightning release includes final API clean-up with better data decoupling and shorter logging syntax.

Were happy to release PyTorch Lightning 0.9 today, which contains many great new features, more bugfixes than any release we ever had, but most importantly it introduced our mostly final API changes! Lightning is being adopted by top researchers and AI labs around the world, and we are working hard to make sure we provide a smooth experience and support for all the latest best practices.

Detail changes

Added

  • Added SyncBN for DDP (#2801, #2838)
  • Added basic CSVLogger (#2721)
  • Added SSIM metrics (#2671)
  • Added BLEU metrics (#2535)
  • Added support to export a model to ONNX format (#2596)
  • Added support for Trainer(num_sanity_val_steps=-1) to check all validation data before training (#2246)
  • Added struct. output:
    • tests for val loop flow (#2605)
    • EvalResult support for train and val. loop (#2615, #2651)
    • weighted average in results obj (#2930)
    • fix result obj DP auto reduce (#3013)
  • Added class LightningDataModule (#2668)
  • Added support for PyTorch 1.6 (#2745)
  • Added call DataModule hooks implicitly in trainer (#2755)
  • Added support for Mean in DDP Sync (#2568)
  • Added remaining sklearn metrics: AveragePrecision, BalancedAccuracy, CohenKappaScore, DCG, Hamming, Hinge, Jaccard, MeanAbsoluteError, MeanSquaredError, MeanSquaredLogError, MedianAbsoluteError, R2Score, MeanPoissonDeviance, MeanGammaDeviance, MeanTweedieDeviance, ExplainedVariance (#2562)
  • Added support for limit_{mode}_batches (int) to work with infinite dataloader (IterableDataset) (#2840)
  • Added support returning python scalars in DP (#1935)
  • Added support to Tensorboard logger for OmegaConf hparams (#2846)
  • Added tracking of basic states in Trainer (#2541)
  • Tracks all outputs including TBPTT and multiple optimizers (#2890)
  • Added GPU Usage Logger (#2932)
  • Added strict=False for load_from_checkpoint (#2819)
  • Added saving test predictions on multiple GPUs (#2926)
  • Auto log the computational graph for loggers that support this (#3003)
  • Added warning when changing monitor and using results obj (#3014)
  • Added a hook transfer_batch_to_device to the LightningDataModule (#3038)

Changed

  • Truncated long version numbers in progress bar (#2594)
  • Enabling val/test loop disabling (#2692)
  • Refactored into accelerator module:
    • GPU training (#2704)
    • TPU training (#2708)
    • DDP(2) backend (#2796)
    • Retrieve last logged val from result by key (#3049)
  • Using .comet.config file for CometLogger (#1913)
  • Updated hooks arguments - breaking for setup and teardown (#2850)
  • Using gfile to support remote directories (#2164)
  • Moved optimizer creation after device placement for DDP backends (#2904](https://github.com/PyTorchLightning/pytorch-lighting/pull/2904))
  • Support **DictConfig for hparam serialization (#2519)
  • Removed callback metrics from test results obj (#2994)
  • Re-enabled naming metrics in ckpt name (#3060)
  • Changed progress bar epoch counting to start from 0 (#3061)

Deprecated

  • Deprecated Trainer attribute ckpt_path, which will now be set by weights_save_path (#2681)

Removed

  • Removed deprecated: (#2760)
    • core decorator data_loader
    • Module hook on_sanity_check_start and loading load_from_metrics
    • package pytorch_lightning.logging
    • Trainer arguments: show_progress_bar, num_tpu_cores, use_amp, print_nan_grads
    • LR Finder argument num_accumulation_steps

Fixed

  • Fixed accumulate_grad_batches for last batch (#2853)
  • Fixed setup call while testing (#2624)
  • Fixed local rank zero casting (#2640)
  • Fixed single scalar return from training (#2587)
  • Fixed Horovod backend to scale LR schedlers with the optimizer (#2626)
  • Fixed dtype and device properties not getting updated in submodules (#2657)
  • Fixed fast_dev_run to run for all dataloaders (#2581)
  • Fixed save_dir in loggers getting ignored by default value of weights_save_path when user did not specify weights_save_path (#2681)
  • Fixed weights_save_path getting ignored when logger=False is passed to Trainer (#2681)
  • Fixed TPU multi-core and Float16 (#2632)
  • Fixed test metrics not being logged with LoggerCollection (#2723)
  • Fixed data transfer to device when using torchtext.data.Field and include_lengths is True (#2689)
  • Fixed shuffle argument for the distributed sampler (#2789)
  • Fixed logging interval (#2694)
  • Fixed loss value in the progress bar is wrong when accumulate_grad_batches > 1 (#2738)
  • Fixed correct CWD for DDP sub-processes when using Hydra (#2719)
  • Fixed selecting GPUs using CUDA_VISIBLE_DEVICES (#2739, #2796)
  • Fixed false num_classes warning in metrics (#2781)
  • Fixed shell injection vulnerability in subprocess call (#2786)
  • Fixed LR finder and hparams compatibility (#2821)
  • Fixed ModelCheckpoint not saving the latest information when save_last=True (#2881)
  • Fixed ImageNet example: learning rate scheduler, number of workers and batch size when using DDP (#2889)
  • Fixed apex gradient clipping (#2829)
  • Fixed save apex scaler states (#2828)
  • Fixed a model loading issue with inheritance and variable positional arguments (#2911)
  • Fixed passing non_blocking=True when transferring a batch object that does not support it (#2910)
  • Fixed checkpointing to remote file paths (#2925)
  • Fixed adding val_step argument to metrics (#2986)
  • Fixed an issue that caused Trainer.test() to stall in DDP mode (#2997)
  • Fixed gathering of results with tensors of varying shape (#3020)
  • Fixed batch size auto-scaling feature to set the new value on the correct model attribute (#3043)
  • Fixed automatic batch scaling not working with half-precision (#3045)
  • Fixed setting device to root GPU (#3042)

Contributors

@ananthsub, @ananyahjha93, @awaelchli, @bkhakshoor, @Borda, @ethanwharris, @f4hy, @groadabike, @ibeltagy, @justusschock, @lezwon, @nateraw, @neighthan, @nsarang, @PhilJd, @pwwang, @rohitgr7, @romesco, @ruotianluo, @shijianjian, @SkafteNicki, @tgaddair, @thschaaf, @williamFalcon, @xmotli02, @ydcjeff, @yukw777, @zerogerc

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Bug fixes and .test() fix + TPU tests

10 Jul 02:01
92d6abc
Compare
Choose a tag to compare

Overview

The point of this release is more bug fixes ahead of v 1.0.0. We now have CI tests on TPU thanks to @zcain117 from Google! 🙂
This means we fixed many TPU bugs we hadn’t caught before because we had no tests.
In addition, we fixed:

  • all the file path errors with loggers (txs @awaelchli)
  • pickling errors with loggers (txs @awaelchli)
  • fixed all the .test() calls

Detail changes

Added

  • Added a PSNR metric: peak signal-to-noise ratio (#2483)
  • Added functional regression metrics (#2492)

Removed

  • Removed auto val reduce (#2462)

Fixed

  • Flattening Wandb Hyperparameters (#2459)
  • Fixed using the same DDP python interpreter and actually running (#2482)
  • Fixed model summary input type conversion for models that have input dtype different from model parameters (#2510)
  • Made TensorBoardLogger and CometLogger pickleable (#2518)
  • Fixed a problem with MLflowLogger creating multiple run folders (#2502)
  • Fixed global_step increment (#2455)
  • Fixed TPU hanging example (#2488)
  • Fixed argparse default value bug (#2526)
  • Fixed Dice and IoU to avoid NaN by adding small eps (#2545)
  • Fixed accumulate gradients schedule at epoch 0 (continued) (#2513)
  • Fixed Trainer .fit() returning last not best weights in "ddp_spawn" (#2565)
  • Fixed passing (do not pass) TPU weights back on test (#2566)
  • Fixed DDP tests and .test() (#2512, #2570)

Contributors

@anthonytec2, @awaelchli, @bernardomig, @Borda, @EspenHa, @HHousen, @InCogNiTo124, @rohitgr7, @williamFalcon

If we forgot someone due to not matching commit email with GitHub account, let us know :]

More bug fixing!

01 Jul 11:56
695e051
Compare
Choose a tag to compare

Detail changes

Added

  • Added reduce ddp results on eval (#2434)
  • Added a warning when an IterableDataset has __len__ defined (#2437)

Changed

  • Enabled no returns from eval (#2446)

Fixed

  • Fixes train outputs (#2428)
  • Fixes Conda dependencies (#2412)
  • Fixed Apex scaling with decoupled backward (#2433)
  • Fixed crashing or wrong displaying progressbar because of missing ipywidgets (#2417)
  • Fixed TPU saving dir (fc26078, 04e68f0)
  • Fixed logging on rank 0 only (#2425)

Contributors

@awaelchli, @Borda, @olineumann, @williamFalcon

Bug fixing

29 Jun 11:38
dec074c
Compare
Choose a tag to compare

Fixed

DDP and Checkpoint bug fixes

29 Jun 02:09
8f07b77
Compare
Choose a tag to compare
Pre-release

Overview

As we continue to strengthen the codebase with more tests, we’re finally getting rid of annoying bugs that have been around for a bit now. Mostly around the inconsistent checkpoint and early stopping behaviour (amazing work @awaelchli @jeremyjordan )

Noteworthy changes:

  • Fixed TPU flag parsing
  • fixed average_precision metric
  • all the checkpoint issues should be gone now (including backward support for old checkpoints)
  • DDP + loggers should be fixed

Detail changes

Added

  • Added TorchText support for moving data to GPU (#2379)

Changed

  • Changed epoch indexing from 0 instead of 1 (#2289)
  • Refactor Model backward (#2276)
  • Refactored training_batch + tests to verify correctness (#2327, #2328)
  • Refactored training loop (#2336)
  • Made optimization steps for hooks (#2363)
  • Changed default apex level to 'O2' (#2362)

Removed

  • Moved TrainsLogger to Bolts (#2384)

Fixed

  • Fixed parsing TPU arguments and TPU tests (#2094)
  • Fixed number batches in case of multiple dataloaders and limit_{*}_batches (#1920, #2226)
  • Fixed an issue with forward hooks not being removed after model summary (#2298)
  • Fix for load_from_checkpoint() not working with absolute path on Windows (#2294)
  • Fixed an issue how _has_len handles NotImplementedError e.g. raised by torchtext.data.Iterator (#2293), (#2307)
  • Fixed average_precision metric (#2319)
  • Fixed ROC metric for CUDA tensors (#2304)
  • Fixed average_precision metric (#2319)
  • Fixed lost compatibility with custom datatypes implementing .to (#2335)
  • Fixed loading model with kwargs (#2387)
  • Fixed sum(0) for trainer.num_val_batches (#2268)
  • Fixed checking if the parameters are a DictConfig Object (#2216)
  • Fixed SLURM weights saving (#2341)
  • Fixed swaps LR scheduler order (#2356)
  • Fixed adding tensorboard hparams logging test (#2342)
  • Fixed use model ref for tear down (#2360)
  • Fixed logger crash on DDP (#2388)
  • Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
  • Fixed loading past checkpoints from v0.7.x (#2405)
  • Fixed loading model without arguments (#2403)

Contributors

@airium, @awaelchli, @Borda, @elias-ramzi, @jeremyjordan, @lezwon, @mateuszpieniak, @mmiakashs, @pwl, @rohitgr7, @ssakhavi, @thschaaf, @tridao, @williamFalcon

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Fixing hooks & hparams

19 Jun 06:44
2fbc997
Compare
Choose a tag to compare

Overview

Fixing critical bugs in newly added hooks and hparams assignment.
The recommended data following:

  1. use prepare_data to download and process the dataset.
  2. use setup to do splits, and build your model internals

Detail changes

  • Fixed the load_from_checkpoint path detected as URL bug (#2244)
  • Fixed hooks - added barrier (#2245, #2257, #2260)
  • Fixed hparams - remove frame inspection on self.hparams (#2253)
  • Fixed setup and on fit calls (#2252)
  • Fixed GPU template (#2255)

Metrics, speed improvements, new hooks and flags

19 Jun 07:02
e0b7359
Compare
Choose a tag to compare

Overview

Highlights of this release are adding Metric package and new hooks and flags to customize your workflow.

Major features:

  • brand new Metrics package with built-in DDP support (by @justusschock and @SkafteNicki)
  • hparams can now be anything! (call self.save_hyperparameters() to register anything in the _init_
  • many speed improvements (how we move data, adjusted some flags & PL now adds 300ms overhead per epoch only!)
  • much faster ddp implementation. Old one was renamed ddp_spawn
  • better support for Hydra
  • added the overfit_batches flag and corrected some bugs with the limit_[train,val,test]_batches flag
  • added conda support
  • tons of bug fixes 😉

Detail changes

Added

  • Added overfit_batches, limit_{val|test}_batches flags (overfit now uses training set for all three) (#2213)
  • Added metrics
  • Added type hints in Trainer.fit() and Trainer.test() to reflect that also a list of dataloaders can be passed in (#1723)
  • Allow dataloaders without sampler field present (#1907)
  • Added option save_last to save the model at the end of every epoch in ModelCheckpoint (#1908)
  • Early stopping checks on_validation_end (#1458)
  • Attribute best_model_path to ModelCheckpoint for storing and later retrieving the path to the best saved model file (#1799)
  • Speed up single-core TPU training by loading data using ParallelLoader (#2033)
  • Added a model hook transfer_batch_to_device that enables moving custom data structures to the target device (#1756)
  • Added black formatter for the code with code-checker on pull (#1610)
  • Added back the slow spawn ddp implementation as ddp_spawn (#2115)
  • Added loading checkpoints from URLs (#1667)
  • Added a callback method on_keyboard_interrupt for handling KeyboardInterrupt events during training (#2134)
  • Added a decorator auto_move_data that moves data to the correct device when using the LightningModule for inference (#1905)
  • Added ckpt_path option to LightningModule.test(...) to load particular checkpoint (#2190)
  • Added setup and teardown hooks for model (#2229)

Changed

  • Allow user to select individual TPU core to train on (#1729)
  • Removed non-finite values from loss in LRFinder (#1862)
  • Allow passing model hyperparameters as complete kwarg list (#1896)
  • Renamed ModelCheckpoint's attributes best to best_model_score and kth_best_model to kth_best_model_path (#1799)
  • Re-Enable Logger's ImportErrors (#1938)
  • Changed the default value of the Trainer argument weights_summary from full to top (#2029)
  • Raise an error when lightning replaces an existing sampler (#2020)
  • Enabled prepare_data from correct processes - clarify local vs global rank (#2166)
  • Remove explicit flush from tensorboard logger (#2126)
  • Changed epoch indexing from 1 instead of 0 (#2206)

Deprecated

  • Deprecated flags: (#2213)
    • overfit_pct in favour of overfit_batches
    • val_percent_check in favour of limit_val_batches
    • test_percent_check in favour of limit_test_batches
  • Deprecated ModelCheckpoint's attributes best and kth_best_model (#1799)
  • Dropped official support/testing for older PyTorch versions <1.3 (#1917)

Removed

  • Removed unintended Trainer argument progress_bar_callback, the callback should be passed in by Trainer(callbacks=[...]) instead (#1855)
  • Removed obsolete self._device in Trainer (#1849)
  • Removed deprecated API (#2073)
    • Packages: pytorch_lightning.pt_overrides, pytorch_lightning.root_module
    • Modules: pytorch_lightning.logging.comet_logger, pytorch_lightning.logging.mlflow_logger, pytorch_lightning.logging.test_tube_logger, pytorch_lightning.overrides.override_data_parallel, pytorch_lightning.core.model_saving, pytorch_lightning.core.root_module
    • Trainer arguments: add_row_log_interval, default_save_path, gradient_clip, nb_gpu_nodes, max_nb_epochs, min_nb_epochs, nb_sanity_val_steps
    • Trainer attributes: nb_gpu_nodes, num_gpu_nodes, gradient_clip, max_nb_epochs, min_nb_epochs, nb_sanity_val_steps, default_save_path, tng_tqdm_dic

Fixed

  • Run graceful training teardown on interpreter exit (#1631)
  • Fixed user warning when apex was used together with learning rate schedulers (#1873)
  • Fixed multiple calls of EarlyStopping callback (#1863)
  • Fixed an issue with Trainer.from_argparse_args when passing in unknown Trainer args (#1932)
  • Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
  • Fixed root node resolution for SLURM cluster with dash in hostname (#1954)
  • Fixed LearningRateLogger in multi-scheduler setting (#1944)
  • Fixed test configuration check and testing (#1804)
  • Fixed an issue with Trainer constructor silently ignoring unknown/misspelt arguments (#1820)
  • Fixed save_weights_only in ModelCheckpoint (#1780)
  • Allow use of same WandbLogger instance for multiple training loops (#2055)
  • Fixed an issue with _auto_collect_arguments collecting local variables that are not constructor arguments and not working for signatures that have the instance not named self (#2048)
  • Fixed mistake in parameters' grad norm tracking (#2012)
  • Fixed CPU and hanging GPU crash (#2118)
  • Fixed an issue with the model summary and example_input_array depending on a specific ordering of the submodules in a LightningModule (#1773)
  • Fixed Tpu logging (#2230)
  • Fixed Pid port + duplicate rank_zero logging (#2140, #2231)

Contributors

@awaelchli, @baldassarreFe, @Borda, @borisdayma, @cuent, @devashishshankar, @ivannz, @j-dsouza, @justusschock, @kepler, @kumuji, @lezwon, @lgvaz, @LoicGrobol, @mateuszpieniak, @maximsch2, @moi90, @rohitgr7, @SkafteNicki, @tullie, @williamFalcon, @yukw777, @ZhaofengWu

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Transfer learning, tuning batch size, torchelastic support

15 May 12:37
e95e1d7
Compare
Choose a tag to compare

Overview

Highlights of this release are adding support for TorchElastic enables distributed PyTorch training jobs to be executed in a fault-tolerant and elastic manner; auto-scaling of batch size; new transfer learning example; an option to provide seed to random generators to ensure reproducibility.

Detail changes

Added

  • Added callback for logging learning rates (#1498)
  • Added transfer learning example (for a binary classification task in computer vision) (#1564)
  • Added type hints in Trainer.fit() and Trainer.test() to reflect that also a list of dataloaders can be passed in (#1723).
  • Added auto scaling of batch size (#1638)
  • The progress bar metrics now also get updated in training_epoch_end (#1724)
  • Enable NeptuneLogger to work with distributed_backend=ddp (#1753)
  • Added option to provide seed to random generators to ensure reproducibility (#1572)
  • Added override for hparams in load_from_ckpt (#1797)
  • Added support multi-node distributed execution under torchelastic (#1811, #1818)
  • Added using store_true for bool args (#1822, #1842)
  • Added dummy logger for internally disabling logging for some features (#1836)

Changed

  • Enable non-blocking for device transfers to GPU (#1843)
  • Replace mata_tags.csv with hparams.yaml (#1271)
  • Reduction when batch_size < num_gpus (#1609)
  • Updated LightningTemplateModel to look more like Colab example (#1577)
  • Don't convert namedtuple to tuple when transferring the batch to target device (#1589)
  • Allow passing hparams as a keyword argument to LightningModule when loading from checkpoint (#1639)
  • Args should come after the last positional argument (#1807)
  • Made DDP the default if no backend specified with multiple GPUs (#1789)

Deprecated

  • Deprecated tags_csv in favor of hparams_file (#1271)

Fixed

  • Fixed broken link in PR template (#1675)
  • Fixed ModelCheckpoint not None checking file path (#1654)
  • Trainer now calls on_load_checkpoint() when resuming from a checkpoint (#1666)
  • Fixed sampler logic for DDP with the iterable dataset (#1734)
  • Fixed _reset_eval_dataloader() for IterableDataset (#1560)
  • Fixed Horovod distributed backend to set the root_gpu property (#1669)
  • Fixed wandb logger global_step affects other loggers (#1492)
  • Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
  • Fixed bugs that prevent LP finder to be used together with early stopping and validation dataloaders (#1676)
  • Fixed a bug in Trainer that prepended the checkpoint path with version_ when it shouldn't (#1748)
  • Fixed LR key name in case of param groups in LearningRateLogger (#1719)
  • Fixed saving native AMP scaler state (introduced in #1561)
  • Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
  • Fixed num processes wasn't being set properly and auto sampler was DDP failing (#1819)
  • Fixed bugs in semantic segmentation example (#1824)
  • Fixed saving native AMP scaler state (#1561, #1777)
  • Fixed native AMP + DDP (#1788)
  • Fixed hparam logging with metrics (#1647)

Contributors

@ashwinb, @awaelchli, @Borda, @cmpute, @festeh, @jbschiratti, @justusschock, @kepler, @kumuji, @nanddalal, @nathanbreitsch, @olineumann, @pitercl, @rohitgr7, @S-aiueo32, @SkafteNicki, @tgaddair, @tullie, @tw991, @williamFalcon, @ybrovman, @yukw777

If we forgot someone due to not matching commit email with GitHub account, let us know :]