Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How did you prepare the manifest dir for pretrain and in which format? #1705

Closed
sanjuktasr opened this issue Jul 26, 2024 · 1 comment
Closed

Comments

@sanjuktasr
Copy link

          Hi there @yfyeung , first of all thank you for creating this SSL recipe!

I tried running your zipformer/ codes, but my model diverged at epoch 33 and pretraining ended with a Grad scale is small error.

Throughout pretraining before the divergence, I noticed my grad scale tended to fluctuate between 0.125 and 2.

Did you face the same issues?

EDIT: I was also wondering if you tried toggling the loss reduction to mean instead of sum. Maybe that will stabilise training?

My commands. I adapted the batch size to my setup, maintaining the same accum_grad * max_duration * world_size.

# pretraining
python zipformer/pretrain.py \
    --world-size 4 \
    --use-fp16 1 \
    --num-epochs 50 \
    --manifest-dir data/raw \
    --max-duration 350 \
    --accum-grad 2 \
    --exp-dir zipformer/exp2/pretrain

As per your explanation, I used the same 500 k-means labels from simple_kmeans.

Originally posted by @teowenshen in #1500 (comment)

@yfyeung
Copy link
Collaborator

yfyeung commented Jul 26, 2024

          Hi there @yfyeung , first of all thank you for creating this SSL recipe!

I tried running your zipformer/ codes, but my model diverged at epoch 33 and pretraining ended with a Grad scale is small error.

Throughout pretraining before the divergence, I noticed my grad scale tended to fluctuate between 0.125 and 2.

Did you face the same issues?

EDIT: I was also wondering if you tried toggling the loss reduction to mean instead of sum. Maybe that will stabilise training?

My commands. I adapted the batch size to my setup, maintaining the same accum_grad * max_duration * world_size.

# pretraining
python zipformer/pretrain.py \
    --world-size 4 \
    --use-fp16 1 \
    --num-epochs 50 \
    --manifest-dir data/raw \
    --max-duration 350 \
    --accum-grad 2 \
    --exp-dir zipformer/exp2/pretrain

As per your explanation, I used the same 500 k-means labels from simple_kmeans.

Originally posted by @teowenshen in #1500 (comment)

Hi, mean is used to simulate multi-GPU training, and sum is used to simulate a large batch size. I used 8 GPUs with a max duration of 600. It seems that maintaining the same accum_grad * max_duration * world_size doesn't match the original setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants