Add a callback to write huggingface checkpoints during the training run #594

dakinggg · 2023-09-12T06:39:11Z

This adds a callback that does the huggingface checkpoint conversion during the training job to avoid further difficulty after the job is complete, and take advantage of backgrounded uploads while training is happening.

add precision
manual mpt test (7b-mpt-hf-ckpt-4-jc3uSG)
manual llama2 test (l7b-hf-ckpt-4-HpZUAg)

mvpatel2000

Should this be in Composer?

mvpatel2000

Will defer to Evan on this one

llmfoundry/callbacks/hf_checkpointer.py

tests/test_hf_conversion_script.py

eracah

Nice looks pretty good. Def doable to make a SaveForInferenceCallback I think in the future. Also some good nuggets here for refactoring CheckpointSaver

llmfoundry/callbacks/hf_checkpointer.py

eracah · 2023-09-12T23:36:39Z

Should this be in Composer?

At some point yes

tests/test_hf_conversion_script.py

mvpatel2000

Can we add a flag to only save at end? I imagine the most common use case is I just want to convert to HF once im done training

dakinggg · 2023-09-14T16:36:00Z

@mvpatel2000 can't you just specify 1dur as the save interval?

mvpatel2000 · 2023-09-14T16:36:48Z

@mvpatel2000 can't you just specify 1dur as the save interval?

🤯 ur so big brained

eracah

Aight LGTM

germanjke · 2023-10-03T13:28:00Z

@eracah @dakinggg hi guys! This thing supports no-MPT models, like LLaMA?

dakinggg · 2023-10-03T16:58:39Z

yes

dakinggg added 20 commits September 7, 2023 23:51

wip

87b2ae8

wip

90185bb

wip

7b9dc24

wip

f9a694b

add init

6ac9aa7

precommit

20309ec

precommit

b332dd2

undo changes to yaml

a911359

add some prints

fd03ca6

tests pass

144af09

add asserts

ad7731d

fix typos

98f72bf

Merge branch 'main' into hf-checkpointer

d2ce2a6

clean up test and fix asserts

7e3ced0

add precision

99a8f97

precommit

a698df3

fix typo

6f0b0f9

precommit

0ac8286

fix end of training bug

ebc2ccd

precommit

975d476

dakinggg marked this pull request as ready for review September 12, 2023 08:37

dakinggg requested review from irenedea, eracah and mvpatel2000 September 12, 2023 08:37

mvpatel2000 reviewed Sep 12, 2023

View reviewed changes

mvpatel2000 self-requested a review September 12, 2023 16:38

mvpatel2000 reviewed Sep 12, 2023

View reviewed changes

llmfoundry/callbacks/hf_checkpointer.py Outdated Show resolved Hide resolved

add comment

efc48c3

irenedea reviewed Sep 12, 2023

View reviewed changes

tests/test_hf_conversion_script.py Outdated Show resolved Hide resolved

eracah reviewed Sep 12, 2023

View reviewed changes

llmfoundry/callbacks/hf_checkpointer.py Show resolved Hide resolved

llmfoundry/callbacks/hf_checkpointer.py Show resolved Hide resolved

llmfoundry/callbacks/hf_checkpointer.py Show resolved Hide resolved

irenedea reviewed Sep 13, 2023

View reviewed changes

tests/test_hf_conversion_script.py Show resolved Hide resolved

dakinggg added 7 commits September 12, 2023 20:45

Merge branch 'main' into hf-checkpointer

e15e85e

switch to log

2622ccf

format

688135e

clean up test

da62ca1

precommit

8462587

fix

9df8ad2

fix

e5720d2

dakinggg requested a review from eracah September 13, 2023 06:00

mvpatel2000 reviewed Sep 14, 2023

View reviewed changes

Merge branch 'main' into hf-checkpointer

b5fb59f

eracah approved these changes Sep 14, 2023

View reviewed changes

Merge branch 'main' into hf-checkpointer

0519b39

dakinggg enabled auto-merge (squash) September 14, 2023 18:38

dakinggg merged commit 30544f0 into mosaicml:main Sep 14, 2023
8 checks passed

dakinggg deleted the hf-checkpointer branch October 11, 2023 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a callback to write huggingface checkpoints during the training run #594

Add a callback to write huggingface checkpoints during the training run #594

dakinggg commented Sep 12, 2023 •

edited

Loading

mvpatel2000 left a comment

mvpatel2000 left a comment

eracah left a comment

eracah commented Sep 12, 2023

mvpatel2000 left a comment

dakinggg commented Sep 14, 2023

mvpatel2000 commented Sep 14, 2023

eracah left a comment

germanjke commented Oct 3, 2023

dakinggg commented Oct 3, 2023

Add a callback to write huggingface checkpoints during the training run #594

Add a callback to write huggingface checkpoints during the training run #594

Conversation

dakinggg commented Sep 12, 2023 • edited Loading

mvpatel2000 left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

eracah left a comment

Choose a reason for hiding this comment

eracah commented Sep 12, 2023

mvpatel2000 left a comment

Choose a reason for hiding this comment

dakinggg commented Sep 14, 2023

mvpatel2000 commented Sep 14, 2023

eracah left a comment

Choose a reason for hiding this comment

germanjke commented Oct 3, 2023

dakinggg commented Oct 3, 2023

dakinggg commented Sep 12, 2023 •

edited

Loading