-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multihost training collapses from time to time when loading the next batch #786
Comments
+1 |
My current workaround is to catch the FileNotFoundError and recreate the data_iterator. To get back to the batch you were on, you can then run load_next_batch multiple times:
but that's a very bad, error-prone, slow solution. |
That's a nice workaround :) thanks a lot! I made a small modification so that you don't have to recreate the data iterator every time and skip all the data from the beginning. If we checkpoint the data iterator using Grain, your function can also be quite efficient with the following modification. This code at the moment works for me :)
PS: it just occurrs to me when writing this reply that maybe a better workaround is to keep a copy of the data iterator's state from the previous step inside the train loop (
Though I have tested this code yet... |
Hi,
I was testing the multi-host training on a v4-16 TPU VM. The training normally runs smoothly, but sometimes, it collapses at
load_next_batch
with the following error from the process 0:The command for running the job is
python3 MaxText/train.py MaxText/configs/gpt2.yml run_name=gpt2 base_output_directory=gs://maxtext_multihost_job steps=120000 dataset_type=hf hf_path=YUE-FAN/openwebtext_gcp hf_data_dir=data tokenizer_path=EleutherAI/gpt-neox-20b eval_interval=4000 hf_eval_split=validation enable_checkpointing=True eval_batch_num=558 per_device_batch_size=32 eval_per_device_batch_size=32 checkpoint_period=10000 logits_via_embedding=True normalize_embedding_logits=True
. I have very limited knowledge about Python multiprocessing, but it seems to be a problem related to reading the shared memory? This problem does not always occur, but it happens from time to time. Any assistance here will be appreciated! Thanks!The text was updated successfully, but these errors were encountered: