Model Parallelism Training and More Logging Options
Overview
Lightning 1.1 is out! You can now train models with twice the parameters and zero code changes with the new sharded model training! We also have a new plugin for sequential model parallelism, more logging options, and a lot of improvements!
Release highlights: https://bit.ly/3gyLZpP
Learn more about sharded training: https://bit.ly/2W3hgI0
Detail changes
Added
- Added "monitor" key to saved
ModelCheckpoints
(#4383) - Added
ConfusionMatrix
class interface (#4348) - Added multiclass AUROC metric (#4236)
- Added global step indexing to the checkpoint name for a better sub-epoch checkpointing experience (#3807)
- Added optimizer hooks in callbacks (#4379)
- Added option to log momentum (#4384)
- Added
current_score
toModelCheckpoint.on_save_checkpoint
(#4721) - Added logging using
self.log
in train and evaluation for epoch end hooks (#4913) - Added ability for DDP plugin to modify optimizer state saving (#4675)
- Added casting to python types for NumPy scalars when logging
hparams
(#4647) - Added
prefix
argument in loggers (#4557) - Added printing of total num of params, trainable and non-trainable params in ModelSummary (#4521)
- Added
PrecisionRecallCurve, ROC, AveragePrecision
class metric (#4549) - Added custom
Apex
andNativeAMP
asPrecision plugins
(#4355) - Added
DALI MNIST
example (#3721) - Added
sharded plugin
for DDP for multi-GPU training memory optimizations (#4773) - Added
experiment_id
to the NeptuneLogger (#3462) - Added
Pytorch Geometric
integration example with Lightning (#4568) - Added
all_gather
method toLightningModule
which allows gradient-based tensor synchronizations for use-cases such as negative sampling. (#5012) - Enabled
self.log
in most functions (#4969) - Added changeable extension variable for
ModelCheckpoint
(#4977)
Changed
- Removed
multiclass_roc
andmulticlass_precision_recall_curve
, useroc
andprecision_recall_curve
instead (#4549) - Tuner algorithms will be skipped if
fast_dev_run=True
(#3903) - WandbLogger does not force wandb
reinit
arg to True anymore and creates a run only when needed (#4648) - Changed
automatic_optimization
to be a model attribute (#4602) - Changed
Simple Profiler
report to order by percentage time spent + num calls (#4880) - Simplify optimization Logic (#4984)
- Classification metrics overhaul (#4837)
- Updated
fast_dev_run
to accept integer representing num_batches (#4629) - Refactored optimizer (#4658)
Deprecated
- Deprecated
prefix
argument inModelCheckpoint
(#4765) - Deprecated the old way of assigning hyper-parameters through
self.hparams = ...
(#4813) - Deprecated
mode='auto'
fromModelCheckpoint
andEarlyStopping
(#4695)
Removed
- Removed
reorder
parameter of theauc
metric (#5004)
Fixed
- Added feature to move tensors to CPU before saving (#4309)
- Fixed
LoggerConnector
to have logged metrics on root device in DP (#4138) - Auto convert tensors to contiguous format when
gather_all
(#4907) - Fixed
PYTHONPATH
for DDP test model (#4528) - Fixed allowing logger to support indexing (#4595)
- Fixed DDP and manual_optimization (#4976)
Contributors
@ananyahjha93, @awaelchli, @blatr, @Borda, @borisdayma, @carmocca, @ddrevicky, @george-gca, @gianscarpe, @irustandi, @janhenriklambrechts, @jeremyjordan, @justusschock, @lezwon, @rohitgr7, @s-rog, @SeanNaren, @SkafteNicki, @tadejsv, @tchaton, @williamFalcon, @zippeurfou
If we forgot someone due to not matching commit email with GitHub account, let us know :]