Releases: Lightning-AI/pytorch-lightning
1.0.0 - General availability
Overview
...
Detail changes
- Added Explained Variance Metric + metric fix (#4013)
- Added Metric <-> Lightning Module integration tests (#4008)
- Added parsing OS env vars in
Trainer
(#4022) - Added classification metrics (#4043)
- Updated explained variance metric (#4024)
- Enabled plugins (#4041)
- Enabled custom clusters (#4048)
- Enabled passing in custom accelerators (#4050)
- Added
LightningModule.toggle_optimizer
(#4058) - Added
LightningModule.manual_backward
(#4063)
Changed
- Integrated metrics API with self.log (#3961)
- Decoupled Appex (#4052, #4054, #4055, #4056, #4058, #4060, #4061, #4062, #4063, #4064, #4065)
- Renamed all backends to
Accelerator
(#4066) - Enabled manual returns (#4089)
Removed
- Removed
output
argument from*_batch_end
hooks (#3965, #3966) - Removed
output
argument from*_epoch_end
hooks (#3967) - Removed support for EvalResult and TrainResult (#3968)
- Removed deprecated trainer flags:
overfit_pct
,log_save_interval
,row_log_interval
(#3969) - Removed deprecated early_stop_callback (#3982)
- Removed deprecated model hooks (#3980)
- Removed deprecated callbacks (#3979)
- Removed
trainer
argument inLightningModule.backward
[#4056)
Fixed
- Fixed
current_epoch
property update to reflect true epoch number insideLightningDataModule
, whenreload_dataloaders_every_epoch=True
. (#3974) - Fixed to print scaler value in progress bar (#4053)
- Fixed mismatch between docstring and code regarding when
on_load_checkpoint
hook is called (#3996)
Contributors
@ananyahjha93, @Borda, @edenlightning, @hbredin, @rohitgr7, @SkafteNicki, @teddykoker, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Buffer release before 1.0
This release is a buffer in case 1.0 breaks any compatibility for people who upgrade. 0.10.0 has all the bug fixes and features of 1.0 but is 100% backward compatible. The 1.0 release following in the next 24 hours.
Overview
The major changes are:
- Results objects are deprecated (we hated them too haha)
- This means dataflow and logging have been decoupled
To log:
def any_step(...):
self.log('something', i_computed)
Separately, return whatever you want from methods:
def training_step(...):
return loss
or
def training_step(...):
return {'loss': loss, 'whatever': [1, 'want']}
Detail changes
Added
- Added new Metrics API. (#3868, [#3921)
- Enable PyTorch 1.7 compatibility (#3541)
- Added
LightningModule.to_torchscript
to support exporting asScriptModule
(#3258) - Added warning when dropping unpicklable
hparams
(#2874) - Added EMB similarity (#3349)
- Added
ModelCheckpoint.to_yaml
method (#3048) - Allow
ModelCheckpoint
monitor to beNone
, meaning it will always save ([3630) - Disabled optimizers setup during testing (#3059)
- Added support for datamodules to save and load checkpoints when training (#3563
- Added support for datamodule in learning rate finder (#3425)
- Added gradient clip test for native AMP (#3754)
- Added dist lib to enable syncing anything across devices (#3762)
- Added
broadcast
toTPUBackend
(#3814) - Added
XLADeviceUtils
class to check XLA device type (#3274)
Changed
- Refactored accelerator backends:
- moved TPU
xxx_step
to backend (#3118) - refactored DDP backend
forward
(#3119) - refactored GPU backend
__step
(#3120) - refactored Horovod backend (#3121, #3122)
- remove obscure forward call in eval + CPU backend
___step
(#3123) - reduced all simplified forward (#3126)
- added hook base method (#3127)
- refactor eval loop to use hooks - use
test_mode
for if so we can split later (#3129) - moved
___step_end
hooks (#3130) - training forward refactor (#3134)
- training AMP scaling refactor (#3135)
- eval step scaling factor (#3136)
- add eval loop object to streamline eval loop (#3138)
- refactored dataloader process hook (#3139)
- refactored inner eval loop (#3141)
- final inner eval loop hooks (#3154)
- clean up hooks in
run_evaluation
(#3156) - clean up data reset (#3161)
- expand eval loop out (#3165)
- moved hooks around in eval loop (#3195)
- remove
_evaluate
fx (#3197) Trainer.fit
hook clean up (#3198)- DDPs train hooks (#3203)
- refactor DDP backend (#3204, #3207, #3208, #3209, #3210)
- reduced accelerator selection (#3211)
- group prepare data hook (#3212)
- added data connector (#3285)
- modular is_overridden (#3290)
- adding
Trainer.tune()
(#3293) - move
run_pretrain_routine
->setup_training
(#3294) - move train outside of setup training (#3297)
- move
prepare_data
to data connector (#3307) - moved accelerator router (#3309)
- train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
- duplicate data interface definition up into DataHooks class (#3344)
- inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
- all logging related calls in a connector (#3395)
- device parser (#3400, #3405)
- added model connector (#3407)
- moved eval loop logging to loggers (#3408)
- moved eval loop (#3412[#3408)
- trainer/separate argparse (#3421, #3428, #3432)
- move
lr_finder
(#3434) - organize args (##3435, #3442, #3447, #3448, #3449, #3456)
- move specific accelerator code (#3457)
- group connectors (#3472)
- accelerator connector methods x/n (#3469, #3470, #3474)
- merge backends (#3476, #3477, #3478, #3480, #3482)
- apex plugin (#3502)
- precision plugins (#3504)
- Result - make monitor default to
checkpoint_on
to simplify (#3571) - reference to the Trainer on the
LightningDataModule
(#3684) - add
.log
to lightning module (#3686, #3699, #3701, #3704, #3715) - enable tracking original metric when step and epoch are both true (#3685)
- deprecated results obj, added support for simpler comms (#3681)
- move backends back to individual files (#3712)
- fixes logging for eval steps (#3763)
- decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806)
- remove weight loading hack for ddp_cpu (#3808)
- separate
torchelastic
from DDP (#3810) - separate SLURM from DDP (#3809)
- decoupled DDP2 (#3816)
- bug fix with logging val epoch end + monitor (#3812)
- decoupled DDP, DDP spawn (#3733, #3817, #3819, #3927)
- callback system and init DDP (#3836)
- adding compute environments (#3837, [#3842)
- epoch can now log independently (#3843)
- test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
- fixed
init_slurm_connection
causing hostname errors (#3856) - moves init apex from LM to apex connector (#3923)
- moves sync bn to each backend (#3925)
- moves configure ddp to each backend (#3924)
- moved TPU
- Deprecation warning (#3844)
- Changed
LearningRateLogger
toLearningRateMonitor
(#3251) - Used
fsspec
instead ofgfile
for all IO (#3320)- Swaped
torch.load
forfsspec
load in DDP spawn backend (#3787) - Swaped
torch.load
forfsspec
load in cloud_io loading (#3692) - Added support for
to_disk()
to use remote filepaths withfsspec
(#3930) - Updated model_checkpoint's to_yaml to use
fsspec
open (#3801) - Fixed
fsspec
is inconsistant when doingfs.ls
(#3805)
- Swaped
- Refactor
GPUStatsMonitor
to improve training speed (#3257) - Changed IoU score behavior for classes absent in target and pred (#3098)
- Changed IoU
remove_bg
bool toignore_index
optional int (#3098) - Changed defaults of
save_top_k
andsave_last
toNone
in ModelCheckpoint (#3680) row_log_interval
andlog_save_interval
are now based on training loop'sglobal_step
instead of epoch-internal batch index (#3667)- Silenced some warnings. verified ddp refactors (#3483)
- Cleaning up stale logger tests (#3490)
- Allow
ModelCheckpoint
monitor to beNone
(#3633) - Enable
None
model checkpoint default (#3669) - Skipped
best_model_path
ifcheckpoint_callback
isNone
(#2962) - Used
raise .. from ..
to explicitly chain exceptions (#3750) - Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
- Write predictions in LightningModule instead of EvalResult [#3882
Deprecated
- Deprecated
TrainResult
andEvalResult
, useself.log
andself.write
from theLightningModule
to log metrics and write predictions.training_step
can now only return a scalar (for the loss) or a dictionary with anything you want. (#3681) - Deprecate
early_stop_callback
Trainer argument (#3845) - Rename Trainer arguments
row_log_interval
>>log_every_n_steps
andlog_save_interval
>>flush_logs_every_n_steps
(#3748)
Removed
- Removed experimental Metric API (#3868, #3943, #3949, #3946), listed changes before final removal:
- Added
EmbeddingSimilarity
metric (#3349, [#3358) - Added hooks to metric module interface (#2528)
- Added error when AUROC metric is used for multiclass problems (#3350)
- Fixed
ModelCheckpoint
withsave_top_k=-1
option not tracking the best models when a monitor metric is available (#3735) - Fixed counter-intuitive error being thrown in
Accuracy
metric for zero target tensor (#3764) - Fixed aggregation of metrics (#3517)
- Fixed Metric aggregation (#3321)
- Fixed RMSLE metric (#3188)
- Renamed
reduction
toclass_reduction
in classification metrics (#3322) - Changed
class_reduction
similar to sklearn for classification metrics (#3322) - Renaming of precision recall metric (#3308)
- Added
Fixed
- Fixed
on_train_batch_start
hook to end epoch early (#3700) - Fixed
num_sanity_val_steps
is clipped tolimit_val_batches
(#2917) - Fixed ONNX model save on GPU (#3145)
- Fixed
GpuUsageLogger
to work on different platforms (#3008) - Fixed auto-scale batch size not dumping
auto_lr_find
parameter (#3151) - Fixed
batch_outputs
with optimizer frequencies (#3229) - Fixed setting batch size in
LightningModule.datamodule
when usingauto_scale_batch_size
(#3266) - Fixed Horovod distributed backend compatibility with native AMP (#3404)
- Fixed batch size auto scaling exceeding the size of the dataset (#3271)
- Fixed getting
experiment_id
from MLFlow only once instead of each training loop (#3394) - Fixed
overfit_batches
which now correctly disables shuffling for the training loader. (#3501) - Fixed gradient norm tracking for
row_log_interval > 1
(#3489) - Fixed
ModelCheckpoint
name formatting ([3164) - Fixed auto-scale batch size (#3151)
- Fixed example implementation of AutoEncoder (#3190)
- Fixed invalid paths when remote logging with TensorBoard (#3236)
- Fixed change
t()
totranspose()
as XLA devices do not support.t()
on 1-dim tensor (#3252) - Fixed (weights only) checkpoints loading without PL (#3287)
- Fixed
gather_all_tensors
cross GPUs in DDP (#3319) - Fixed CometML save dir (#3419)
- Fixed forward key metrics (#3467)
- Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
- Fixed global step increment in training loop when
training_epoch_end
hook is used (#3673) - Fixed dataloader shuffling not getting turned off with
overfit_batches > 0
anddistributed_backend = "ddp"
(#3534) - Fixed determinism in
DDPSpawnBackend
when usingseed_everything
in main process (#3335) - Fixed
ModelCheckpoint
period
to actually save everyperiod
epochs (#363...
synced BatchNorm, DataModules and final API
Overview
The newest PyTorch Lightning release includes final API clean-up with better data decoupling and shorter logging syntax.
Were happy to release PyTorch Lightning 0.9 today, which contains many great new features, more bugfixes than any release we ever had, but most importantly it introduced our mostly final API changes! Lightning is being adopted by top researchers and AI labs around the world, and we are working hard to make sure we provide a smooth experience and support for all the latest best practices.
Detail changes
Added
- Added SyncBN for DDP (#2801, #2838)
- Added basic
CSVLogger
(#2721) - Added SSIM metrics (#2671)
- Added BLEU metrics (#2535)
- Added support to export a model to ONNX format (#2596)
- Added support for
Trainer(num_sanity_val_steps=-1)
to check all validation data before training (#2246) - Added struct. output:
- Added class
LightningDataModule
(#2668) - Added support for PyTorch 1.6 (#2745)
- Added call DataModule hooks implicitly in trainer (#2755)
- Added support for Mean in DDP Sync (#2568)
- Added remaining
sklearn
metrics:AveragePrecision
,BalancedAccuracy
,CohenKappaScore
,DCG
,Hamming
,Hinge
,Jaccard
,MeanAbsoluteError
,MeanSquaredError
,MeanSquaredLogError
,MedianAbsoluteError
,R2Score
,MeanPoissonDeviance
,MeanGammaDeviance
,MeanTweedieDeviance
,ExplainedVariance
(#2562) - Added support for
limit_{mode}_batches (int)
to work with infinite dataloader (IterableDataset) (#2840) - Added support returning python scalars in DP (#1935)
- Added support to Tensorboard logger for OmegaConf
hparams
(#2846) - Added tracking of basic states in
Trainer
(#2541) - Tracks all outputs including TBPTT and multiple optimizers (#2890)
- Added GPU Usage Logger (#2932)
- Added
strict=False
forload_from_checkpoint
(#2819) - Added saving test predictions on multiple GPUs (#2926)
- Auto log the computational graph for loggers that support this (#3003)
- Added warning when changing monitor and using results obj (#3014)
- Added a hook
transfer_batch_to_device
to theLightningDataModule
(#3038)
Changed
- Truncated long version numbers in progress bar (#2594)
- Enabling val/test loop disabling (#2692)
- Refactored into
accelerator
module: - Using
.comet.config
file forCometLogger
(#1913) - Updated hooks arguments - breaking for
setup
andteardown
(#2850) - Using
gfile
to support remote directories (#2164) - Moved optimizer creation after device placement for DDP backends (#2904](https://github.com/PyTorchLightning/pytorch-lighting/pull/2904))
- Support
**DictConfig
forhparam
serialization (#2519) - Removed callback metrics from test results obj (#2994)
- Re-enabled naming metrics in ckpt name (#3060)
- Changed progress bar epoch counting to start from 0 (#3061)
Deprecated
- Deprecated Trainer attribute
ckpt_path
, which will now be set byweights_save_path
(#2681)
Removed
- Removed deprecated: (#2760)
- core decorator
data_loader
- Module hook
on_sanity_check_start
and loadingload_from_metrics
- package
pytorch_lightning.logging
- Trainer arguments:
show_progress_bar
,num_tpu_cores
,use_amp
,print_nan_grads
- LR Finder argument
num_accumulation_steps
- core decorator
Fixed
- Fixed
accumulate_grad_batches
for last batch (#2853) - Fixed setup call while testing (#2624)
- Fixed local rank zero casting (#2640)
- Fixed single scalar return from training (#2587)
- Fixed Horovod backend to scale LR schedlers with the optimizer (#2626)
- Fixed
dtype
anddevice
properties not getting updated in submodules (#2657) - Fixed
fast_dev_run
to run for all dataloaders (#2581) - Fixed
save_dir
in loggers getting ignored by default value ofweights_save_path
when user did not specifyweights_save_path
(#2681) - Fixed
weights_save_path
getting ignored whenlogger=False
is passed to Trainer (#2681) - Fixed TPU multi-core and Float16 (#2632)
- Fixed test metrics not being logged with
LoggerCollection
(#2723) - Fixed data transfer to device when using
torchtext.data.Field
andinclude_lengths is True
(#2689) - Fixed shuffle argument for the distributed sampler (#2789)
- Fixed logging interval (#2694)
- Fixed loss value in the progress bar is wrong when
accumulate_grad_batches > 1
(#2738) - Fixed correct CWD for DDP sub-processes when using Hydra (#2719)
- Fixed selecting GPUs using
CUDA_VISIBLE_DEVICES
(#2739, #2796) - Fixed false
num_classes
warning in metrics (#2781) - Fixed shell injection vulnerability in subprocess call (#2786)
- Fixed LR finder and
hparams
compatibility (#2821) - Fixed
ModelCheckpoint
not saving the latest information whensave_last=True
(#2881) - Fixed ImageNet example: learning rate scheduler, number of workers and batch size when using DDP (#2889)
- Fixed apex gradient clipping (#2829)
- Fixed save apex scaler states (#2828)
- Fixed a model loading issue with inheritance and variable positional arguments (#2911)
- Fixed passing
non_blocking=True
when transferring a batch object that does not support it (#2910) - Fixed checkpointing to remote file paths (#2925)
- Fixed adding
val_step
argument to metrics (#2986) - Fixed an issue that caused
Trainer.test()
to stall in DDP mode (#2997) - Fixed gathering of results with tensors of varying shape (#3020)
- Fixed batch size auto-scaling feature to set the new value on the correct model attribute (#3043)
- Fixed automatic batch scaling not working with half-precision (#3045)
- Fixed setting device to root GPU (#3042)
Contributors
@ananthsub, @ananyahjha93, @awaelchli, @bkhakshoor, @Borda, @ethanwharris, @f4hy, @groadabike, @ibeltagy, @justusschock, @lezwon, @nateraw, @neighthan, @nsarang, @PhilJd, @pwwang, @rohitgr7, @romesco, @ruotianluo, @shijianjian, @SkafteNicki, @tgaddair, @thschaaf, @williamFalcon, @xmotli02, @ydcjeff, @yukw777, @zerogerc
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Bug fixes and .test() fix + TPU tests
Overview
The point of this release is more bug fixes ahead of v 1.0.0. We now have CI tests on TPU thanks to @zcain117 from Google! 🙂
This means we fixed many TPU bugs we hadn’t caught before because we had no tests.
In addition, we fixed:
- all the file path errors with loggers (txs @awaelchli)
- pickling errors with loggers (txs @awaelchli)
- fixed all the .test() calls
Detail changes
Added
Removed
- Removed auto val reduce (#2462)
Fixed
- Flattening Wandb Hyperparameters (#2459)
- Fixed using the same DDP python interpreter and actually running (#2482)
- Fixed model summary input type conversion for models that have input dtype different from model parameters (#2510)
- Made
TensorBoardLogger
andCometLogger
pickleable (#2518) - Fixed a problem with
MLflowLogger
creating multiple run folders (#2502) - Fixed global_step increment (#2455)
- Fixed TPU hanging example (#2488)
- Fixed
argparse
default value bug (#2526) - Fixed Dice and IoU to avoid NaN by adding small eps (#2545)
- Fixed accumulate gradients schedule at epoch 0 (continued) (#2513)
- Fixed Trainer
.fit()
returning last not best weights in "ddp_spawn" (#2565) - Fixed passing (do not pass) TPU weights back on test (#2566)
- Fixed DDP tests and
.test()
(#2512, #2570)
Contributors
@anthonytec2, @awaelchli, @bernardomig, @Borda, @EspenHa, @HHousen, @InCogNiTo124, @rohitgr7, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
More bug fixing!
Detail changes
Added
- Added reduce ddp results on eval (#2434)
- Added a warning when an
IterableDataset
has__len__
defined (#2437)
Changed
- Enabled no returns from eval (#2446)
Fixed
- Fixes train outputs (#2428)
- Fixes Conda dependencies (#2412)
- Fixed Apex scaling with decoupled backward (#2433)
- Fixed crashing or wrong displaying progressbar because of missing ipywidgets (#2417)
- Fixed TPU saving dir (fc26078, 04e68f0)
- Fixed logging on rank 0 only (#2425)
Contributors
Bug fixing
DDP and Checkpoint bug fixes
Overview
As we continue to strengthen the codebase with more tests, we’re finally getting rid of annoying bugs that have been around for a bit now. Mostly around the inconsistent checkpoint and early stopping behaviour (amazing work @awaelchli @jeremyjordan )
Noteworthy changes:
- Fixed TPU flag parsing
- fixed average_precision metric
- all the checkpoint issues should be gone now (including backward support for old checkpoints)
- DDP + loggers should be fixed
Detail changes
Added
- Added TorchText support for moving data to GPU (#2379)
Changed
- Changed epoch indexing from 0 instead of 1 (#2289)
- Refactor Model
backward
(#2276) - Refactored
training_batch
+ tests to verify correctness (#2327, #2328) - Refactored training loop (#2336)
- Made optimization steps for hooks (#2363)
- Changed default apex level to 'O2' (#2362)
Removed
- Moved
TrainsLogger
to Bolts (#2384)
Fixed
- Fixed parsing TPU arguments and TPU tests (#2094)
- Fixed number batches in case of multiple dataloaders and
limit_{*}_batches
(#1920, #2226) - Fixed an issue with forward hooks not being removed after model summary (#2298)
- Fix for
load_from_checkpoint()
not working with absolute path on Windows (#2294) - Fixed an issue how _has_len handles
NotImplementedError
e.g. raised bytorchtext.data.Iterator
(#2293), (#2307) - Fixed
average_precision
metric (#2319) - Fixed ROC metric for CUDA tensors (#2304)
- Fixed
average_precision
metric (#2319) - Fixed lost compatibility with custom datatypes implementing
.to
(#2335) - Fixed loading model with kwargs (#2387)
- Fixed sum(0) for
trainer.num_val_batches
(#2268) - Fixed checking if the parameters are a
DictConfig
Object (#2216) - Fixed SLURM weights saving (#2341)
- Fixed swaps LR scheduler order (#2356)
- Fixed adding tensorboard
hparams
logging test (#2342) - Fixed use model ref for tear down (#2360)
- Fixed logger crash on DDP (#2388)
- Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
- Fixed loading past checkpoints from v0.7.x (#2405)
- Fixed loading model without arguments (#2403)
Contributors
@airium, @awaelchli, @Borda, @elias-ramzi, @jeremyjordan, @lezwon, @mateuszpieniak, @mmiakashs, @pwl, @rohitgr7, @ssakhavi, @thschaaf, @tridao, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Fixing hooks & hparams
Overview
Fixing critical bugs in newly added hooks and hparams
assignment.
The recommended data following:
- use
prepare_data
to download and process the dataset. - use
setup
to do splits, and build your model internals
Detail changes
Metrics, speed improvements, new hooks and flags
Overview
Highlights of this release are adding Metric package and new hooks and flags to customize your workflow.
Major features:
- brand new Metrics package with built-in DDP support (by @justusschock and @SkafteNicki)
hparams
can now be anything! (callself.save_hyperparameters()
to register anything in the_init_
- many speed improvements (how we move data, adjusted some flags & PL now adds 300ms overhead per epoch only!)
- much faster
ddp
implementation. Old one was renamedddp_spawn
- better support for Hydra
- added the overfit_batches flag and corrected some bugs with the
limit_[train,val,test]_batches
flag - added conda support
- tons of bug fixes 😉
Detail changes
Added
- Added
overfit_batches
,limit_{val|test}_batches
flags (overfit now uses training set for all three) (#2213) - Added metrics
- Added type hints in
Trainer.fit()
andTrainer.test()
to reflect that also a list of dataloaders can be passed in (#1723) - Allow dataloaders without sampler field present (#1907)
- Added option
save_last
to save the model at the end of every epoch inModelCheckpoint
(#1908) - Early stopping checks
on_validation_end
(#1458) - Attribute
best_model_path
toModelCheckpoint
for storing and later retrieving the path to the best saved model file (#1799) - Speed up single-core TPU training by loading data using
ParallelLoader
(#2033) - Added a model hook
transfer_batch_to_device
that enables moving custom data structures to the target device (#1756) - Added black formatter for the code with code-checker on pull (#1610)
- Added back the slow spawn ddp implementation as
ddp_spawn
(#2115) - Added loading checkpoints from URLs (#1667)
- Added a callback method
on_keyboard_interrupt
for handling KeyboardInterrupt events during training (#2134) - Added a decorator
auto_move_data
that moves data to the correct device when using the LightningModule for inference (#1905) - Added
ckpt_path
option toLightningModule.test(...)
to load particular checkpoint (#2190) - Added
setup
andteardown
hooks for model (#2229)
Changed
- Allow user to select individual TPU core to train on (#1729)
- Removed non-finite values from loss in
LRFinder
(#1862) - Allow passing model hyperparameters as complete kwarg list (#1896)
- Renamed
ModelCheckpoint
's attributesbest
tobest_model_score
andkth_best_model
tokth_best_model_path
(#1799) - Re-Enable Logger's
ImportError
s (#1938) - Changed the default value of the Trainer argument
weights_summary
fromfull
totop
(#2029) - Raise an error when lightning replaces an existing sampler (#2020)
- Enabled prepare_data from correct processes - clarify local vs global rank (#2166)
- Remove explicit flush from tensorboard logger (#2126)
- Changed epoch indexing from 1 instead of 0 (#2206)
Deprecated
- Deprecated flags: (#2213)
overfit_pct
in favour ofoverfit_batches
val_percent_check
in favour oflimit_val_batches
test_percent_check
in favour oflimit_test_batches
- Deprecated
ModelCheckpoint
's attributesbest
andkth_best_model
(#1799) - Dropped official support/testing for older PyTorch versions <1.3 (#1917)
Removed
- Removed unintended Trainer argument
progress_bar_callback
, the callback should be passed in byTrainer(callbacks=[...])
instead (#1855) - Removed obsolete
self._device
in Trainer (#1849) - Removed deprecated API (#2073)
- Packages:
pytorch_lightning.pt_overrides
,pytorch_lightning.root_module
- Modules:
pytorch_lightning.logging.comet_logger
,pytorch_lightning.logging.mlflow_logger
,pytorch_lightning.logging.test_tube_logger
,pytorch_lightning.overrides.override_data_parallel
,pytorch_lightning.core.model_saving
,pytorch_lightning.core.root_module
- Trainer arguments:
add_row_log_interval
,default_save_path
,gradient_clip
,nb_gpu_nodes
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
- Trainer attributes:
nb_gpu_nodes
,num_gpu_nodes
,gradient_clip
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
,default_save_path
,tng_tqdm_dic
- Packages:
Fixed
- Run graceful training teardown on interpreter exit (#1631)
- Fixed user warning when apex was used together with learning rate schedulers (#1873)
- Fixed multiple calls of
EarlyStopping
callback (#1863) - Fixed an issue with
Trainer.from_argparse_args
when passing in unknown Trainer args (#1932) - Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
- Fixed root node resolution for SLURM cluster with dash in hostname (#1954)
- Fixed
LearningRateLogger
in multi-scheduler setting (#1944) - Fixed test configuration check and testing (#1804)
- Fixed an issue with Trainer constructor silently ignoring unknown/misspelt arguments (#1820)
- Fixed
save_weights_only
in ModelCheckpoint (#1780) - Allow use of same
WandbLogger
instance for multiple training loops (#2055) - Fixed an issue with
_auto_collect_arguments
collecting local variables that are not constructor arguments and not working for signatures that have the instance not namedself
(#2048) - Fixed mistake in parameters' grad norm tracking (#2012)
- Fixed CPU and hanging GPU crash (#2118)
- Fixed an issue with the model summary and
example_input_array
depending on a specific ordering of the submodules in a LightningModule (#1773) - Fixed Tpu logging (#2230)
- Fixed Pid port + duplicate
rank_zero
logging (#2140, #2231)
Contributors
@awaelchli, @baldassarreFe, @Borda, @borisdayma, @cuent, @devashishshankar, @ivannz, @j-dsouza, @justusschock, @kepler, @kumuji, @lezwon, @lgvaz, @LoicGrobol, @mateuszpieniak, @maximsch2, @moi90, @rohitgr7, @SkafteNicki, @tullie, @williamFalcon, @yukw777, @ZhaofengWu
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Transfer learning, tuning batch size, torchelastic support
Overview
Highlights of this release are adding support for TorchElastic enables distributed PyTorch training jobs to be executed in a fault-tolerant and elastic manner; auto-scaling of batch size; new transfer learning example; an option to provide seed to random generators to ensure reproducibility.
Detail changes
Added
- Added callback for logging learning rates (#1498)
- Added transfer learning example (for a binary classification task in computer vision) (#1564)
- Added type hints in
Trainer.fit()
andTrainer.test()
to reflect that also a list of dataloaders can be passed in (#1723). - Added auto scaling of batch size (#1638)
- The progress bar metrics now also get updated in
training_epoch_end
(#1724) - Enable
NeptuneLogger
to work withdistributed_backend=ddp
(#1753) - Added option to provide seed to random generators to ensure reproducibility (#1572)
- Added override for hparams in
load_from_ckpt
(#1797) - Added support multi-node distributed execution under
torchelastic
(#1811, #1818) - Added using
store_true
for bool args (#1822, #1842) - Added dummy logger for internally disabling logging for some features (#1836)
Changed
- Enable
non-blocking
for device transfers to GPU (#1843) - Replace mata_tags.csv with hparams.yaml (#1271)
- Reduction when
batch_size < num_gpus
(#1609) - Updated LightningTemplateModel to look more like Colab example (#1577)
- Don't convert
namedtuple
totuple
when transferring the batch to target device (#1589) - Allow passing
hparams
as a keyword argument to LightningModule when loading from checkpoint (#1639) - Args should come after the last positional argument (#1807)
- Made DDP the default if no backend specified with multiple GPUs (#1789)
Deprecated
- Deprecated
tags_csv
in favor ofhparams_file
(#1271)
Fixed
- Fixed broken link in PR template (#1675)
- Fixed ModelCheckpoint not None checking file path (#1654)
- Trainer now calls
on_load_checkpoint()
when resuming from a checkpoint (#1666) - Fixed sampler logic for DDP with the iterable dataset (#1734)
- Fixed
_reset_eval_dataloader()
for IterableDataset (#1560) - Fixed Horovod distributed backend to set the
root_gpu
property (#1669) - Fixed wandb logger
global_step
affects other loggers (#1492) - Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
- Fixed bugs that prevent LP finder to be used together with early stopping and validation dataloaders (#1676)
- Fixed a bug in Trainer that prepended the checkpoint path with
version_
when it shouldn't (#1748) - Fixed LR key name in case of param groups in LearningRateLogger (#1719)
- Fixed saving native AMP scaler state (introduced in #1561)
- Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
- Fixed num processes wasn't being set properly and auto sampler was DDP failing (#1819)
- Fixed bugs in semantic segmentation example (#1824)
- Fixed saving native AMP scaler state (#1561, #1777)
- Fixed native AMP + DDP (#1788)
- Fixed
hparam
logging with metrics (#1647)
Contributors
@ashwinb, @awaelchli, @Borda, @cmpute, @festeh, @jbschiratti, @justusschock, @kepler, @kumuji, @nanddalal, @nathanbreitsch, @olineumann, @pitercl, @rohitgr7, @S-aiueo32, @SkafteNicki, @tgaddair, @tullie, @tw991, @williamFalcon, @ybrovman, @yukw777
If we forgot someone due to not matching commit email with GitHub account, let us know :]