Fixing model checkpoints to be robust to mutli -> single GPU usage #1091

pruksmhc · 2020-05-17T17:32:20Z

This is a fix to #1087. I decided to make the change in the model loading portion because making the change in model saving as suggested in the #1087 will fix multi -> single GPU model loading, but will break multi -> multi GPU model loading (if we want to reload a checkpoint that was trained in multi-GPU on a multi-GPU machine).
Additionally, I also did some light cleanup of model loading in trainer to not be redundant, and also deleted an unused parameter.

Tests
Multi -> Single GPU: I tested by training a roberta-large model on SST on multi-GPU, and then loading that checkpoint in a single-GPU for further training.
Multi -> Multi GPU: This is implicitly already done in jiant, specifically we load the best checkpoint before doing evaluation, so this was tested when I trained the roberta-large SST model the first time on multi-GPU.

pep8speaks · 2020-05-17T17:32:33Z

Hello @pruksmhc! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file jiant/__main__.py:

Line 541:55: W291 trailing whitespace
Line 548:12: W291 trailing whitespace

In the file jiant/utils/utils.py:

Line 329:77: W291 trailing whitespace

You can repair most issues by installing black and running: black -l 100 ./*. If you contribute often, have a look at the 'Contributing' section of the README for instructions on doing this automatically.

Comment last updated at 2020-05-18 20:46:53 UTC

…into fix_multi_to_single

zphang · 2020-05-18T04:47:32Z

jiant/utils/utils.py

+            if "module" in key:
+                key = key.replace("module.", "")


Make this check explicitly for prefix with .startswith, and drop the first n characters (in case module appears somewhere else in the parameter name).

zphang · 2020-05-18T04:47:38Z

jiant/utils/utils.py

+
+    for name, weights in model_state.items():
+        key = get_key(name)
+        final_model_state[key] = model_state[name]


= weights

zphang · 2020-05-18T04:48:25Z

jiant/trainer.py

-                log.error("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
-
-        self._model.load_state_dict(model_state, strict=False)
+        load_model_state(self._model, model_path)


Is this warning being disabled?

No it's not. This warning is also inside load_model_state function

pyeres · 2020-05-19T19:00:07Z

Hi @zphang & @HaokunLiu — are either of you available to provide the substantial review for this PR? The core concerns seem to be 1) whether this addresses issue #1087, and 2) whether these changes introduce new risks/regressions.

sleepinyourhat · 2020-05-21T21:55:52Z

jiant/utils/utils.py

+    """
+    final_model_state = collections.OrderedDict()
+
+    def get_key(name):


Add a comment explaining why we need this logic.

sleepinyourhat · 2020-05-21T21:58:45Z

jiant/trainer.py

-                log.error("Parameter missing from checkpoint: " + name)
-                log.error("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
-
-        self._model.load_state_dict(model_state, strict=False)


You're no longer setting strict=False here. It's debatable whether that's the ideal behavior here, but it was intentional, and I believe it has had some real experimental uses. Why the change?

jeswan · 2020-09-17T16:09:00Z

Are these changes still necessary? Planning to close all PRs to move jiant2 to this repo in the near future.

pruksmhc and others added 2 commits May 16, 2020 19:59

Adding get_model_state

d2016d4

deleting unused parameter and fixing model loading

de0c4a9

pruksmhc requested review from iftenney, pyeres, sleepinyourhat and W4ngatang as code owners May 17, 2020 17:32

Merge branch 'master' into fix_multi_to_single

0cc7e28

pruksmhc changed the title ~~Fixing model checkpoints to be robust to mutli -> single GPU usage~~ [WIP] Fixing model checkpoints to be robust to mutli -> single GPU usage May 17, 2020

Yada Pruksachatkun added 2 commits May 17, 2020 13:38

Adding documentation

5c13ae5

Merge branch 'fix_multi_to_single' of https://github.com/nyu-mll/jiant …

7d98ea3

…into fix_multi_to_single

pruksmhc changed the title ~~[WIP] Fixing model checkpoints to be robust to mutli -> single GPU usage~~ Fixing model checkpoints to be robust to mutli -> single GPU usage May 17, 2020

zphang reviewed May 18, 2020

View reviewed changes

pruksmhc added 2 commits May 18, 2020 13:38

Cleaning up code

c98a1b7

Fix typo

ece69ba

sleepinyourhat requested changes May 21, 2020

View reviewed changes

jeswan mentioned this pull request Sep 17, 2020

Fixing model checkpoints to be robust to mutli -> single GPU usage nyu-mll/jiant-v1-legacy#1091

Open

jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020

jeswan closed this Sep 22, 2020

jeswan deleted the fix_multi_to_single branch September 22, 2020 03:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing model checkpoints to be robust to mutli -> single GPU usage #1091

Fixing model checkpoints to be robust to mutli -> single GPU usage #1091

pruksmhc commented May 17, 2020 •

edited

Loading

pep8speaks commented May 17, 2020 •

edited

Loading

zphang May 18, 2020

zphang May 18, 2020

zphang May 18, 2020

pruksmhc May 18, 2020

pyeres commented May 19, 2020

sleepinyourhat May 21, 2020

sleepinyourhat May 21, 2020

jeswan commented Sep 17, 2020

Fixing model checkpoints to be robust to mutli -> single GPU usage #1091

Fixing model checkpoints to be robust to mutli -> single GPU usage #1091

Conversation

pruksmhc commented May 17, 2020 • edited Loading

pep8speaks commented May 17, 2020 • edited Loading

Comment last updated at 2020-05-18 20:46:53 UTC

zphang May 18, 2020

Choose a reason for hiding this comment

zphang May 18, 2020

Choose a reason for hiding this comment

zphang May 18, 2020

Choose a reason for hiding this comment

pruksmhc May 18, 2020

Choose a reason for hiding this comment

pyeres commented May 19, 2020

sleepinyourhat May 21, 2020

Choose a reason for hiding this comment

sleepinyourhat May 21, 2020

Choose a reason for hiding this comment

jeswan commented Sep 17, 2020

pruksmhc commented May 17, 2020 •

edited

Loading

pep8speaks commented May 17, 2020 •

edited

Loading