Skip to content

Commit

Permalink
delete loaded ckpt after use to save memory
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: facebookresearch#574

Currently, d2go runner doesn't delete checkpoint after loading. This is fine if we run `resume=True` because all the model/optimizer/ema state in the checkpoint will be loaded into the corresponding training components. However, in the case of `resume=False`, only model state will be loaded and the optimizer/ema state will be left in memory until the end of training. This could potentially cause OOM if the checkpoint size is large.

This diff deletes loaded ckpt after use to save memory and avoid potentiall OOM issues.

Reviewed By: tglik

Differential Revision: D46674618

fbshipit-source-id: 2b70a8e46c7f2a309f83cc4deefe5d7a14783734
  • Loading branch information
Anthony Chen authored and facebook-github-bot committed Jun 13, 2023
1 parent a879c1b commit 3fce52c
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions d2go/runner/default_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -568,6 +568,7 @@ def do_train(self, cfg, model, resume):
if resume and checkpointer.has_checkpoint()
else -1
)
del checkpoint
# The checkpoint stores the training iteration that just finished, thus we start
# at the next iteration (or iter zero if there's no checkpoint).
start_iter += 1
Expand Down

0 comments on commit 3fce52c

Please sign in to comment.