-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalidate trace cache @ step 0 and module 740: cache has only 0 modules #119
Comments
It's weird. The error message disappeared after I switched to deepspeed stage 2. But the eval loss is still 0...
|
I'm having the exact same issue, can anyone help with that? |
same issue |
Interestingly, the loss starts at 0 but increases as the training continues But I faced another error at the evaluation of epoch 0.35
Can anyone help with that? |
I meet the same error with you in epoch 0.3 |
There was a new commit bumping math-verify to 0.3.3, I've tried upgrading math-verify and it seems to solve the "compare inequalities" issue |
I'm running the DRPO stage with this command (notice that I've changed the model from 7B QWen to a 1.5B model).
accelerate launch --config_file configs/zero3.yaml src/open_r1/grpo.py --output_dir Qwen2.5-Math-1.5B-Instruct-GRPO --model_name_or_path Qwen/Qwen2.5-Math-1.5B-Instruct --dataset_name AI-MO/NuminaMath-TIR --max_prompt_length 256 --per_device_train_batch_size 1 --gradient_accumulation_steps 16 --logging_steps 10 --bf16
During training, this error message keeps poping up in the log, and my evaluation loss stays at 0 (which is obviously a problem).
Invalidate trace cache @ step 0 and module 740: cache has only 0 modules
I know that this is an unsolved issue from DeepSpeed, I'm just trying to understand why this is happening to my particular setup and whether others have seen the same error message
The text was updated successfully, but these errors were encountered: