-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How did you prepare the manifest dir for pretrain and in which format? #1705
Comments
Hi, mean is used to simulate multi-GPU training, and sum is used to simulate a large batch size. I used 8 GPUs with a max duration of 600. It seems that maintaining the same accum_grad * max_duration * world_size doesn't match the original setup. |
I tried running your zipformer/ codes, but my model diverged at epoch 33 and pretraining ended with a
Grad scale is small
error.Throughout pretraining before the divergence, I noticed my
grad scale
tended to fluctuate between 0.125 and 2.Did you face the same issues?
EDIT: I was also wondering if you tried toggling the loss reduction to
mean
instead ofsum
. Maybe that will stabilise training?My commands. I adapted the batch size to my setup, maintaining the same
accum_grad * max_duration * world_size
.# pretraining python zipformer/pretrain.py \ --world-size 4 \ --use-fp16 1 \ --num-epochs 50 \ --manifest-dir data/raw \ --max-duration 350 \ --accum-grad 2 \ --exp-dir zipformer/exp2/pretrain
As per your explanation, I used the same 500 k-means labels from
simple_kmeans
.Originally posted by @teowenshen in #1500 (comment)
The text was updated successfully, but these errors were encountered: