Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify getting latest checkpoint in examples/albert #425

Closed
wants to merge 2 commits into from

Conversation

mryab
Copy link
Member

@mryab mryab commented Dec 15, 2021

From Discord:

Based on huggingface/transformers#10334, I think it's safe for us to rely on transformers for getting the last checkpoint from output_dir and get rid of it in our code. I've drafted #425, will test later this week.

@codecov
Copy link

codecov bot commented Dec 15, 2021

Codecov Report

Merging #425 (b829c7f) into master (595b831) will decrease coverage by 0.13%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #425      +/-   ##
==========================================
- Coverage   84.06%   83.92%   -0.14%     
==========================================
  Files          77       77              
  Lines        7924     7924              
==========================================
- Hits         6661     6650      -11     
- Misses       1263     1274      +11     
Impacted Files Coverage Δ
hivemind/averaging/matchmaking.py 85.11% <0.00%> (-2.39%) ⬇️
hivemind/averaging/averager.py 87.65% <0.00%> (-0.73%) ⬇️

@mryab
Copy link
Member Author

mryab commented Dec 27, 2021

Tested it with the checkpoint present in the directory

Before:

Dec 27 15:12:14.244 [INFO] Checkpoint dir outputs, contents [PosixPath('outputs/checkpoint-125000'), PosixPath('outputs/checkpoint-124500')]
Dec 27 15:12:14.244 [INFO] Loading model from outputs/checkpoint-125000

After (https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L1096):

Dec 27 15:13:20.843 [INFO] max_steps is given, it will override any value given in num_train_epochs
Dec 27 15:13:20.843 [INFO] Using amp fp16 backend
Dec 27 15:13:20.844 [INFO] Loading model from outputs/checkpoint-125000).

@mryab
Copy link
Member Author

mryab commented Dec 27, 2021

Unfortunately, without any checkpoint in the directory, it gives

    raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
ValueError: No valid checkpoint found in output directory (outputs)

So I think I'll close this for now, since removing the code will not make it possible to launch the same script in both cases and expect it to work

@mryab mryab closed this Dec 27, 2021
@mryab mryab deleted the simplify_example branch December 27, 2021 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant