-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimum's DeBERTa-V2 behavior strange when training with ORT (training hangs or takes impossibly long) #305
Comments
I observe that in the implementation of DeBERTa in transformers, there are some numpy/math operations that led to incorrect export. See details here. As the fairscale distributed (simple) works correctly with ORTTrainer for other models, I suspect that the unnormal training behaviors come from the fact that ONNX subgraphs are not correctly traced. I will open a PR in transformers to correct this, and then check if this is the root of the issue. |
Hi @carzh, Some updates on the issue, the problem comes from the implementation of DeBERTas in transformers:
I just tested with the transformers after both fixes, and now the distributed training works for fp32, while failed for fp16 with the following error message:
It seems that the inputs of a |
Hi @carzh, I just opened a PR in transformers to fix this issue. |
Thanks @JingyaHuang |
@JingyaHuang @askhade this branch can run in my side. |
Awesome, thanks for trying out @zhijxu-MS @askhade ! |
The fix has been merged in transformers, closing the issue. |
System Info
Running with CUDA 11.5, Python 3.8, and torch 11.11. I installed the Python dependencies from requirements.txt in the text-classification example folder. I installed transformers from source, and tried running with Optimum from source as well as pip installing Optimum, and got the same results for both.
Running in Ubuntu image on a VM with 8 V100 GPUs.
Who can help?
@JingyaHuang @echarlaix
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
After properly setting up the environment, I run the following:
It downloads & tokenizes the dataset, then when it begins setting up ONNX/gets to the line that trains the ORTTrainer, it hangs for around 7 minutes 40 seconds (give or take 5 seconds) with no terminal output and GPU utilization at 0. After that wait, it continues as per usual, but trains very slowly and with a lot of terminal output logs about the ONNX graph. The terminal output is being printed so fast that it's hard for me to read the messages and there's no status bar visible for training progress. I let it train for over 4 days, and it still hadn't finished.
I ran the same arguments on the corresponding examples run_glue.py script from the Transformers repository without adding the Optimum ORTTrainer, and it finished training within an hour -- it also did not print out any terminal output beyond the expected status bars and warnings.
Finally, I tried modifying the examples run_glue.py script from the Transformers repository to add the Optimum ORTTrainer, and it printed a lot of terminal output with the ONNX graph information, such that the status bar if it was printed was obscured.
I did not run into any error messages, just strange behavior with the training hanging, the logs, and the unnaturally long training time.
Thanks for your time! Please let me know if I set up my environment incorrectly etc.
Expected behavior
Trains successfully -- I ran the corresponding examples run_glue.py script from the Transformers repository with the same arguments and it finished training within the hour.
The text was updated successfully, but these errors were encountered: