-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training more than one epoch #914
Comments
Hi @peregilk, the new behavior is documented here: https://github.com/AI-Hypercomputer/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#huggingface-pipeline-in-multihost. |
@aireenmei Thanks a lot for the explanation. I thought the drop in weights and loss here did hurt the model, and was wondering why this did not show up in my evaluations. Now it makes total sense. Thanks. |
@aireenmei Just a couple of minor issues. Attaching to this thread since they are related. I followed the instructions on the page above, and discovered two minor issues:
|
Thanks for reporting. Yes setting eval_steps is recommended, it's no longer for debugging only. I'll update that. |
@aireenmei Referring you here, because I think this issue is touched in #571 where you write:
The behaviour now seems to have changed a bit, and it might even be more confusing. I am a bit uncertain what has changed in the code here.
What I am trying to do is switching dataset during training. Here from step 160k. This is a fairly small special task dataset, and I am studying the effect. The dataset has 256 shards, and one epoch is roughly 350 steps.
Here is what is happening with comments:
The text was updated successfully, but these errors were encountered: