enh: enable loading model weights from training checkpoint #3969

geoffreyangus · 2024-03-18T20:45:30Z

Title. Should enable users to use models given a pytorch training checkpoint instead of just finalized models. Useful if jobs error midway through training.

for more information, see https://pre-commit.ci

Infernaught

LGTM, but why do we set this to False by default? Is there a reason why this shouldn't be the default behavior?

github-actions · 2024-03-18T21:21:01Z

Unit Test Results

  6 files ±0   6 suites ±0 14m 24s ⏱️ - 2m 39s
12 tests ±0   9 ✔️ ±0   3 💤 ±0 0 ❌ ±0
60 runs ±0 42 ✔️ ±0 18 💤 ±0 0 ❌ ±0

Results for commit d242b5e. ± Comparison against base commit c09d5dc.

♻️ This comment has been updated with latest results.

geoffreyangus · 2024-03-18T23:13:12Z

@Infernaught it's an interesting point– we want it False by default just because that preserves the existing user behavior. We can consider changing that if it is an overall better experience, but for now don't want to introduce confusion.

…udwig into load_weights_from_checkpoint

arnavgarg1 · 2024-03-18T23:36:49Z

ludwig/api.py

+        :param from_checkpoint: (bool, default: `False`) if `True`, the model
+            will be loaded from the latest checkpoint (training_checkpoints/)
+            instead of the final model weights.


Yeah I guess this is a fair thing to do, wondering though what the reason would be to load from_checkpoint if model/model_weights is already present? Perhaps a no-op in that case and we can always make this True? Okay keeping it like this for now as well, just wanted to call it out

Yeah– I think it's okay to have it be explicit and not make assumptions on behalf of the user for now.

We can revisit this if people get confused, but because it's default False, I'm hopeful this won't interrupt anyone's experience (until of course someone really needs it, at which point we can direct them to this flag)

ludwig/api.py

ludwig/trainers/trainer.py

alexsherstinsky

@geoffreyangus ✅ -- code LGTM (and it was illuminating -- thanks!) -- I was just wondering, how did/do we test something like this, where we called distributed? Thank you!

geoffreyangus · 2024-03-20T00:15:17Z

I added a test for this @alexsherstinsky, if you want to take a look!

alexsherstinsky · 2024-03-20T00:40:16Z

tests/integration_tests/test_model_save_and_load.py

@@ -32,6 +32,79 @@
 )


+def test_model_load_from_checkpoint(tmpdir, csv_filename, tmp_path):


❤️ @geoffreyangus Thank you -- so cool! For my edification, which checkpoint are we comparing? It looks like the loaded model from storage (ludwig_model_2) is the result of training ludwig_model_1 all the way through the 1 epoch -- as opposed to some intermediate checkpoint (e.g., saved after a few steps). Is this correct? Thank you!

yeah that is correct– it's just the latest checkpoint deposited into training_checkpoints/ during the course of training. In the 1 epoch case, it is equivalent to the model weights at the end of training.

For some reason on my local, the models weren't equivalent after 2 epochs– my hunch is that this is because there was a difference between the "best" checkpoint (loaded at the end of training) and the "latest" checkpoint

enh: enable loading model weights from training checkpoint

9448a2e

geoffreyangus requested review from w4nderlust, tgaddair, justinxzhao, arnavgarg1, jeffkinnison, Infernaught and alexsherstinsky as code owners March 18, 2024 20:45

[pre-commit.ci] auto fixes from pre-commit.com hooks

b92a34b

for more information, see https://pre-commit.ci

Infernaught approved these changes Mar 18, 2024

View reviewed changes

add exception raised if trying to resume from blank checkpoint directory

c77b032

Merge branch 'load_weights_from_checkpoint' of github.com:ludwig-ai/l…

8faba73

…udwig into load_weights_from_checkpoint

arnavgarg1 reviewed Mar 18, 2024

View reviewed changes

arnavgarg1 approved these changes Mar 18, 2024

View reviewed changes

alexsherstinsky reviewed Mar 19, 2024

View reviewed changes

ludwig/api.py Show resolved Hide resolved

alexsherstinsky reviewed Mar 19, 2024

View reviewed changes

ludwig/trainers/trainer.py Show resolved Hide resolved

alexsherstinsky approved these changes Mar 19, 2024

View reviewed changes

add test

d242b5e

justinxzhao approved these changes Mar 19, 2024

View reviewed changes

geoffreyangus merged commit 25e4ac1 into master Mar 20, 2024
18 checks passed

geoffreyangus deleted the load_weights_from_checkpoint branch March 20, 2024 00:15

alexsherstinsky reviewed Mar 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enh: enable loading model weights from training checkpoint #3969

enh: enable loading model weights from training checkpoint #3969

geoffreyangus commented Mar 18, 2024

Infernaught left a comment

github-actions bot commented Mar 18, 2024 •

edited

Loading

geoffreyangus commented Mar 18, 2024

arnavgarg1 Mar 18, 2024

geoffreyangus Mar 18, 2024

alexsherstinsky left a comment

geoffreyangus commented Mar 20, 2024

alexsherstinsky Mar 20, 2024

geoffreyangus Mar 21, 2024

alexsherstinsky Mar 21, 2024

		@@ -32,6 +32,79 @@
		)


		def test_model_load_from_checkpoint(tmpdir, csv_filename, tmp_path):

enh: enable loading model weights from training checkpoint #3969

enh: enable loading model weights from training checkpoint #3969

Conversation

geoffreyangus commented Mar 18, 2024

Infernaught left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 18, 2024 • edited Loading

Unit Test Results

geoffreyangus commented Mar 18, 2024

arnavgarg1 Mar 18, 2024

Choose a reason for hiding this comment

geoffreyangus Mar 18, 2024

Choose a reason for hiding this comment

alexsherstinsky left a comment

Choose a reason for hiding this comment

geoffreyangus commented Mar 20, 2024

alexsherstinsky Mar 20, 2024

Choose a reason for hiding this comment

geoffreyangus Mar 21, 2024

Choose a reason for hiding this comment

alexsherstinsky Mar 21, 2024

Choose a reason for hiding this comment

github-actions bot commented Mar 18, 2024 •

edited

Loading