Skip to content

Buffer release before 1.0

Compare
Choose a tag to compare
@williamFalcon williamFalcon released this 07 Oct 21:16
· 7109 commits to master since this release
b4051e7

This release is a buffer in case 1.0 breaks any compatibility for people who upgrade. 0.10.0 has all the bug fixes and features of 1.0 but is 100% backward compatible. The 1.0 release following in the next 24 hours.

Overview

The major changes are:

  • Results objects are deprecated (we hated them too haha)
  • This means dataflow and logging have been decoupled

To log:

def any_step(...):
   self.log('something', i_computed)

Separately, return whatever you want from methods:

def training_step(...):
  return loss

or

def training_step(...):
   return {'loss': loss, 'whatever': [1, 'want']}

Detail changes

Added

  • Added new Metrics API. (#3868, [#3921)
  • Enable PyTorch 1.7 compatibility (#3541)
  • Added LightningModule.to_torchscript to support exporting as ScriptModule (#3258)
  • Added warning when dropping unpicklable hparams (#2874)
  • Added EMB similarity (#3349)
  • Added ModelCheckpoint.to_yaml method (#3048)
  • Allow ModelCheckpoint monitor to be None, meaning it will always save ([3630)
  • Disabled optimizers setup during testing (#3059)
  • Added support for datamodules to save and load checkpoints when training (#3563
  • Added support for datamodule in learning rate finder (#3425)
  • Added gradient clip test for native AMP (#3754)
  • Added dist lib to enable syncing anything across devices (#3762)
  • Added broadcast to TPUBackend (#3814)
  • Added XLADeviceUtils class to check XLA device type (#3274)

Changed

  • Refactored accelerator backends:
    • moved TPU xxx_step to backend (#3118)
    • refactored DDP backend forward (#3119)
    • refactored GPU backend __step (#3120)
    • refactored Horovod backend (#3121, #3122)
    • remove obscure forward call in eval + CPU backend ___step (#3123)
    • reduced all simplified forward (#3126)
    • added hook base method (#3127)
    • refactor eval loop to use hooks - use test_mode for if so we can split later (#3129)
    • moved ___step_end hooks (#3130)
    • training forward refactor (#3134)
    • training AMP scaling refactor (#3135)
    • eval step scaling factor (#3136)
    • add eval loop object to streamline eval loop (#3138)
    • refactored dataloader process hook (#3139)
    • refactored inner eval loop (#3141)
    • final inner eval loop hooks (#3154)
    • clean up hooks in run_evaluation (#3156)
    • clean up data reset (#3161)
    • expand eval loop out (#3165)
    • moved hooks around in eval loop (#3195)
    • remove _evaluate fx (#3197)
    • Trainer.fit hook clean up (#3198)
    • DDPs train hooks (#3203)
    • refactor DDP backend (#3204, #3207, #3208, #3209, #3210)
    • reduced accelerator selection (#3211)
    • group prepare data hook (#3212)
    • added data connector (#3285)
    • modular is_overridden (#3290)
    • adding Trainer.tune() (#3293)
    • move run_pretrain_routine -> setup_training (#3294)
    • move train outside of setup training (#3297)
    • move prepare_data to data connector (#3307)
    • moved accelerator router (#3309)
    • train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
    • duplicate data interface definition up into DataHooks class (#3344)
    • inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
    • all logging related calls in a connector (#3395)
    • device parser (#3400, #3405)
    • added model connector (#3407)
    • moved eval loop logging to loggers (#3408)
    • moved eval loop (#3412[#3408)
    • trainer/separate argparse (#3421, #3428, #3432)
    • move lr_finder (#3434)
    • organize args (##3435, #3442, #3447, #3448, #3449, #3456)
    • move specific accelerator code (#3457)
    • group connectors (#3472)
    • accelerator connector methods x/n (#3469, #3470, #3474)
    • merge backends (#3476, #3477, #3478, #3480, #3482)
    • apex plugin (#3502)
    • precision plugins (#3504)
    • Result - make monitor default to checkpoint_on to simplify (#3571)
    • reference to the Trainer on the LightningDataModule (#3684)
    • add .log to lightning module (#3686, #3699, #3701, #3704, #3715)
    • enable tracking original metric when step and epoch are both true (#3685)
    • deprecated results obj, added support for simpler comms (#3681)
    • move backends back to individual files (#3712)
    • fixes logging for eval steps (#3763)
    • decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806)
    • remove weight loading hack for ddp_cpu (#3808)
    • separate torchelastic from DDP (#3810)
    • separate SLURM from DDP (#3809)
    • decoupled DDP2 (#3816)
    • bug fix with logging val epoch end + monitor (#3812)
    • decoupled DDP, DDP spawn (#3733, #3817, #3819, #3927)
    • callback system and init DDP (#3836)
    • adding compute environments (#3837, [#3842)
    • epoch can now log independently (#3843)
    • test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
    • fixed init_slurm_connection causing hostname errors (#3856)
    • moves init apex from LM to apex connector (#3923)
    • moves sync bn to each backend (#3925)
    • moves configure ddp to each backend (#3924)
  • Deprecation warning (#3844)
  • Changed LearningRateLogger to LearningRateMonitor (#3251)
  • Used fsspec instead of gfile for all IO (#3320)
    • Swaped torch.load for fsspec load in DDP spawn backend (#3787)
    • Swaped torch.load for fsspec load in cloud_io loading (#3692)
    • Added support for to_disk() to use remote filepaths with fsspec (#3930)
    • Updated model_checkpoint's to_yaml to use fsspec open (#3801)
    • Fixed fsspec is inconsistant when doing fs.ls (#3805)
  • Refactor GPUStatsMonitor to improve training speed (#3257)
  • Changed IoU score behavior for classes absent in target and pred (#3098)
  • Changed IoU remove_bg bool to ignore_index optional int (#3098)
  • Changed defaults of save_top_k and save_last to None in ModelCheckpoint (#3680)
  • row_log_interval and log_save_interval are now based on training loop's global_step instead of epoch-internal batch index (#3667)
  • Silenced some warnings. verified ddp refactors (#3483)
  • Cleaning up stale logger tests (#3490)
  • Allow ModelCheckpoint monitor to be None (#3633)
  • Enable None model checkpoint default (#3669)
  • Skipped best_model_path if checkpoint_callback is None (#2962)
  • Used raise .. from .. to explicitly chain exceptions (#3750)
  • Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
  • Write predictions in LightningModule instead of EvalResult [#3882

Deprecated

  • Deprecated TrainResult and EvalResult, use self.log and self.write from the LightningModule to log metrics and write predictions. training_step can now only return a scalar (for the loss) or a dictionary with anything you want. (#3681)
  • Deprecate early_stop_callback Trainer argument (#3845)
  • Rename Trainer arguments row_log_interval >> log_every_n_steps and log_save_interval >> flush_logs_every_n_steps (#3748)

Removed

  • Removed experimental Metric API (#3868, #3943, #3949, #3946), listed changes before final removal:
    • Added EmbeddingSimilarity metric (#3349, [#3358)
    • Added hooks to metric module interface (#2528)
    • Added error when AUROC metric is used for multiclass problems (#3350)
    • Fixed ModelCheckpoint with save_top_k=-1 option not tracking the best models when a monitor metric is available (#3735)
    • Fixed counter-intuitive error being thrown in Accuracy metric for zero target tensor (#3764)
    • Fixed aggregation of metrics (#3517)
    • Fixed Metric aggregation (#3321)
    • Fixed RMSLE metric (#3188)
    • Renamed reduction to class_reduction in classification metrics (#3322)
    • Changed class_reduction similar to sklearn for classification metrics (#3322)
    • Renaming of precision recall metric (#3308)

Fixed

  • Fixed on_train_batch_start hook to end epoch early (#3700)
  • Fixed num_sanity_val_steps is clipped to limit_val_batches (#2917)
  • Fixed ONNX model save on GPU (#3145)
  • Fixed GpuUsageLogger to work on different platforms (#3008)
  • Fixed auto-scale batch size not dumping auto_lr_find parameter (#3151)
  • Fixed batch_outputs with optimizer frequencies (#3229)
  • Fixed setting batch size in LightningModule.datamodule when using auto_scale_batch_size (#3266)
  • Fixed Horovod distributed backend compatibility with native AMP (#3404)
  • Fixed batch size auto scaling exceeding the size of the dataset (#3271)
  • Fixed getting experiment_id from MLFlow only once instead of each training loop (#3394)
  • Fixed overfit_batches which now correctly disables shuffling for the training loader. (#3501)
  • Fixed gradient norm tracking for row_log_interval > 1 (#3489)
  • Fixed ModelCheckpoint name formatting ([3164)
  • Fixed auto-scale batch size (#3151)
  • Fixed example implementation of AutoEncoder (#3190)
  • Fixed invalid paths when remote logging with TensorBoard (#3236)
  • Fixed change t() to transpose() as XLA devices do not support .t() on 1-dim tensor (#3252)
  • Fixed (weights only) checkpoints loading without PL (#3287)
  • Fixed gather_all_tensors cross GPUs in DDP (#3319)
  • Fixed CometML save dir (#3419)
  • Fixed forward key metrics (#3467)
  • Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
  • Fixed global step increment in training loop when training_epoch_end hook is used (#3673)
  • Fixed dataloader shuffling not getting turned off with overfit_batches > 0 and distributed_backend = "ddp" (#3534)
  • Fixed determinism in DDPSpawnBackend when using seed_everything in main process (#3335)
  • Fixed ModelCheckpoint period to actually save every period epochs (#3630)
  • Fixed val_progress_bar total with num_sanity_val_steps (#3751)
  • Fixed Tuner dump: add current_epoch to dumped_params (#3261)
  • Fixed current_epoch and global_step properties mismatch between Trainer and LightningModule (#3785)
  • Fixed learning rate scheduler for optimizers with internal state (#3897)
  • Fixed tbptt_reduce_fx when non-floating tensors are logged (#3796)
  • Fixed model checkpoint frequency (#3852)
  • Fixed logging non-tensor scalar with result breaks subsequent epoch aggregation (#3855)
  • Fixed TrainerEvaluationLoopMixin activates model.train() at the end (#3858)
  • Fixed overfit_batches when using with multiple val/test_dataloaders (#3857)
  • Fixed enables training_step to return None (#3862)
  • Fixed init nan for checkpointing (#3863)
  • Fixed for load_from_checkpoint (#2776)
  • Fixes incorrect batch_sizes when Dataloader returns a dict with multiple tensors (#3668)
  • Fixed unexpected signature for validation_step (#3947)

Contributors

@abrahambotros, @akihironitta, @ananthsub, @ananyahjha93, @awaelchli, @Borda, @c00k1ez, @carmocca, @f4hy, @GimmickNG, @jbschiratti, @justusschock, @LeeJZh, @lezwon, @Lucas-Steinmann, @maxjeblick, @monney, @mpariente, @nateraw, @nrupatunga, @patrickorlando, @PhilJd, @rohitgr7, @s-rog, @ShomyLiu, @SkafteNicki, @Sordie, @teddykoker, @tgaddair, @Vozf, @williamFalcon, @XDynames, @ydcjeff

If we forgot someone due to not matching the commit email with GitHub account, let us know :]