Buffer release before 1.0
This release is a buffer in case 1.0 breaks any compatibility for people who upgrade. 0.10.0 has all the bug fixes and features of 1.0 but is 100% backward compatible. The 1.0 release following in the next 24 hours.
Overview
The major changes are:
- Results objects are deprecated (we hated them too haha)
- This means dataflow and logging have been decoupled
To log:
def any_step(...):
self.log('something', i_computed)
Separately, return whatever you want from methods:
def training_step(...):
return loss
or
def training_step(...):
return {'loss': loss, 'whatever': [1, 'want']}
Detail changes
Added
- Added new Metrics API. (#3868, [#3921)
- Enable PyTorch 1.7 compatibility (#3541)
- Added
LightningModule.to_torchscript
to support exporting asScriptModule
(#3258) - Added warning when dropping unpicklable
hparams
(#2874) - Added EMB similarity (#3349)
- Added
ModelCheckpoint.to_yaml
method (#3048) - Allow
ModelCheckpoint
monitor to beNone
, meaning it will always save ([3630) - Disabled optimizers setup during testing (#3059)
- Added support for datamodules to save and load checkpoints when training (#3563
- Added support for datamodule in learning rate finder (#3425)
- Added gradient clip test for native AMP (#3754)
- Added dist lib to enable syncing anything across devices (#3762)
- Added
broadcast
toTPUBackend
(#3814) - Added
XLADeviceUtils
class to check XLA device type (#3274)
Changed
- Refactored accelerator backends:
- moved TPU
xxx_step
to backend (#3118) - refactored DDP backend
forward
(#3119) - refactored GPU backend
__step
(#3120) - refactored Horovod backend (#3121, #3122)
- remove obscure forward call in eval + CPU backend
___step
(#3123) - reduced all simplified forward (#3126)
- added hook base method (#3127)
- refactor eval loop to use hooks - use
test_mode
for if so we can split later (#3129) - moved
___step_end
hooks (#3130) - training forward refactor (#3134)
- training AMP scaling refactor (#3135)
- eval step scaling factor (#3136)
- add eval loop object to streamline eval loop (#3138)
- refactored dataloader process hook (#3139)
- refactored inner eval loop (#3141)
- final inner eval loop hooks (#3154)
- clean up hooks in
run_evaluation
(#3156) - clean up data reset (#3161)
- expand eval loop out (#3165)
- moved hooks around in eval loop (#3195)
- remove
_evaluate
fx (#3197) Trainer.fit
hook clean up (#3198)- DDPs train hooks (#3203)
- refactor DDP backend (#3204, #3207, #3208, #3209, #3210)
- reduced accelerator selection (#3211)
- group prepare data hook (#3212)
- added data connector (#3285)
- modular is_overridden (#3290)
- adding
Trainer.tune()
(#3293) - move
run_pretrain_routine
->setup_training
(#3294) - move train outside of setup training (#3297)
- move
prepare_data
to data connector (#3307) - moved accelerator router (#3309)
- train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
- duplicate data interface definition up into DataHooks class (#3344)
- inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
- all logging related calls in a connector (#3395)
- device parser (#3400, #3405)
- added model connector (#3407)
- moved eval loop logging to loggers (#3408)
- moved eval loop (#3412[#3408)
- trainer/separate argparse (#3421, #3428, #3432)
- move
lr_finder
(#3434) - organize args (##3435, #3442, #3447, #3448, #3449, #3456)
- move specific accelerator code (#3457)
- group connectors (#3472)
- accelerator connector methods x/n (#3469, #3470, #3474)
- merge backends (#3476, #3477, #3478, #3480, #3482)
- apex plugin (#3502)
- precision plugins (#3504)
- Result - make monitor default to
checkpoint_on
to simplify (#3571) - reference to the Trainer on the
LightningDataModule
(#3684) - add
.log
to lightning module (#3686, #3699, #3701, #3704, #3715) - enable tracking original metric when step and epoch are both true (#3685)
- deprecated results obj, added support for simpler comms (#3681)
- move backends back to individual files (#3712)
- fixes logging for eval steps (#3763)
- decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806)
- remove weight loading hack for ddp_cpu (#3808)
- separate
torchelastic
from DDP (#3810) - separate SLURM from DDP (#3809)
- decoupled DDP2 (#3816)
- bug fix with logging val epoch end + monitor (#3812)
- decoupled DDP, DDP spawn (#3733, #3817, #3819, #3927)
- callback system and init DDP (#3836)
- adding compute environments (#3837, [#3842)
- epoch can now log independently (#3843)
- test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
- fixed
init_slurm_connection
causing hostname errors (#3856) - moves init apex from LM to apex connector (#3923)
- moves sync bn to each backend (#3925)
- moves configure ddp to each backend (#3924)
- moved TPU
- Deprecation warning (#3844)
- Changed
LearningRateLogger
toLearningRateMonitor
(#3251) - Used
fsspec
instead ofgfile
for all IO (#3320)- Swaped
torch.load
forfsspec
load in DDP spawn backend (#3787) - Swaped
torch.load
forfsspec
load in cloud_io loading (#3692) - Added support for
to_disk()
to use remote filepaths withfsspec
(#3930) - Updated model_checkpoint's to_yaml to use
fsspec
open (#3801) - Fixed
fsspec
is inconsistant when doingfs.ls
(#3805)
- Swaped
- Refactor
GPUStatsMonitor
to improve training speed (#3257) - Changed IoU score behavior for classes absent in target and pred (#3098)
- Changed IoU
remove_bg
bool toignore_index
optional int (#3098) - Changed defaults of
save_top_k
andsave_last
toNone
in ModelCheckpoint (#3680) row_log_interval
andlog_save_interval
are now based on training loop'sglobal_step
instead of epoch-internal batch index (#3667)- Silenced some warnings. verified ddp refactors (#3483)
- Cleaning up stale logger tests (#3490)
- Allow
ModelCheckpoint
monitor to beNone
(#3633) - Enable
None
model checkpoint default (#3669) - Skipped
best_model_path
ifcheckpoint_callback
isNone
(#2962) - Used
raise .. from ..
to explicitly chain exceptions (#3750) - Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
- Write predictions in LightningModule instead of EvalResult [#3882
Deprecated
- Deprecated
TrainResult
andEvalResult
, useself.log
andself.write
from theLightningModule
to log metrics and write predictions.training_step
can now only return a scalar (for the loss) or a dictionary with anything you want. (#3681) - Deprecate
early_stop_callback
Trainer argument (#3845) - Rename Trainer arguments
row_log_interval
>>log_every_n_steps
andlog_save_interval
>>flush_logs_every_n_steps
(#3748)
Removed
- Removed experimental Metric API (#3868, #3943, #3949, #3946), listed changes before final removal:
- Added
EmbeddingSimilarity
metric (#3349, [#3358) - Added hooks to metric module interface (#2528)
- Added error when AUROC metric is used for multiclass problems (#3350)
- Fixed
ModelCheckpoint
withsave_top_k=-1
option not tracking the best models when a monitor metric is available (#3735) - Fixed counter-intuitive error being thrown in
Accuracy
metric for zero target tensor (#3764) - Fixed aggregation of metrics (#3517)
- Fixed Metric aggregation (#3321)
- Fixed RMSLE metric (#3188)
- Renamed
reduction
toclass_reduction
in classification metrics (#3322) - Changed
class_reduction
similar to sklearn for classification metrics (#3322) - Renaming of precision recall metric (#3308)
- Added
Fixed
- Fixed
on_train_batch_start
hook to end epoch early (#3700) - Fixed
num_sanity_val_steps
is clipped tolimit_val_batches
(#2917) - Fixed ONNX model save on GPU (#3145)
- Fixed
GpuUsageLogger
to work on different platforms (#3008) - Fixed auto-scale batch size not dumping
auto_lr_find
parameter (#3151) - Fixed
batch_outputs
with optimizer frequencies (#3229) - Fixed setting batch size in
LightningModule.datamodule
when usingauto_scale_batch_size
(#3266) - Fixed Horovod distributed backend compatibility with native AMP (#3404)
- Fixed batch size auto scaling exceeding the size of the dataset (#3271)
- Fixed getting
experiment_id
from MLFlow only once instead of each training loop (#3394) - Fixed
overfit_batches
which now correctly disables shuffling for the training loader. (#3501) - Fixed gradient norm tracking for
row_log_interval > 1
(#3489) - Fixed
ModelCheckpoint
name formatting ([3164) - Fixed auto-scale batch size (#3151)
- Fixed example implementation of AutoEncoder (#3190)
- Fixed invalid paths when remote logging with TensorBoard (#3236)
- Fixed change
t()
totranspose()
as XLA devices do not support.t()
on 1-dim tensor (#3252) - Fixed (weights only) checkpoints loading without PL (#3287)
- Fixed
gather_all_tensors
cross GPUs in DDP (#3319) - Fixed CometML save dir (#3419)
- Fixed forward key metrics (#3467)
- Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
- Fixed global step increment in training loop when
training_epoch_end
hook is used (#3673) - Fixed dataloader shuffling not getting turned off with
overfit_batches > 0
anddistributed_backend = "ddp"
(#3534) - Fixed determinism in
DDPSpawnBackend
when usingseed_everything
in main process (#3335) - Fixed
ModelCheckpoint
period
to actually save everyperiod
epochs (#3630) - Fixed
val_progress_bar
total withnum_sanity_val_steps
(#3751) - Fixed Tuner dump: add
current_epoch
to dumped_params (#3261) - Fixed
current_epoch
andglobal_step
properties mismatch betweenTrainer
andLightningModule
(#3785) - Fixed learning rate scheduler for optimizers with internal state (#3897)
- Fixed
tbptt_reduce_fx
when non-floating tensors are logged (#3796) - Fixed model checkpoint frequency (#3852)
- Fixed logging non-tensor scalar with result breaks subsequent epoch aggregation (#3855)
- Fixed
TrainerEvaluationLoopMixin
activatesmodel.train()
at the end (#3858) - Fixed
overfit_batches
when using with multiple val/test_dataloaders (#3857) - Fixed enables
training_step
to returnNone
(#3862) - Fixed init nan for checkpointing (#3863)
- Fixed for
load_from_checkpoint
(#2776) - Fixes incorrect
batch_sizes
when Dataloader returns a dict with multiple tensors (#3668) - Fixed unexpected signature for
validation_step
(#3947)
Contributors
@abrahambotros, @akihironitta, @ananthsub, @ananyahjha93, @awaelchli, @Borda, @c00k1ez, @carmocca, @f4hy, @GimmickNG, @jbschiratti, @justusschock, @LeeJZh, @lezwon, @Lucas-Steinmann, @maxjeblick, @monney, @mpariente, @nateraw, @nrupatunga, @patrickorlando, @PhilJd, @rohitgr7, @s-rog, @ShomyLiu, @SkafteNicki, @Sordie, @teddykoker, @tgaddair, @Vozf, @williamFalcon, @XDynames, @ydcjeff
If we forgot someone due to not matching the commit email with GitHub account, let us know :]