Optimum's DeBERTa-V2 behavior strange when training with ORT (training hangs or takes impossibly long) #305

carzh · 2022-07-19T02:11:39Z

System Info

Running with CUDA 11.5, Python 3.8, and torch 11.11. I installed the Python dependencies from requirements.txt in the text-classification example folder. I installed transformers from source, and tried running with Optimum from source as well as pip installing Optimum, and got the same results for both.

Running in Ubuntu image on a VM with 8 V100 GPUs.

Who can help?

@JingyaHuang @echarlaix

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

After properly setting up the environment, I run the following:

python -m torch.distributed.run --nproc_per_node=8 run_glue.py --model_name_or_path microsoft/deberta-v2-xxlarge --task_name MRPC --do_train --max_seq_length 128 --per_device_train_batch_size 1 --learning_rate 3e-6 --max_steps 8000 --output_dir /tmp/deberta_res --overwrite_output_dir --logging_steps 8000 --fp16 --sharded_ddp simple --num_train_epochs 1

It downloads & tokenizes the dataset, then when it begins setting up ONNX/gets to the line that trains the ORTTrainer, it hangs for around 7 minutes 40 seconds (give or take 5 seconds) with no terminal output and GPU utilization at 0. After that wait, it continues as per usual, but trains very slowly and with a lot of terminal output logs about the ONNX graph. The terminal output is being printed so fast that it's hard for me to read the messages and there's no status bar visible for training progress. I let it train for over 4 days, and it still hadn't finished.

I ran the same arguments on the corresponding examples run_glue.py script from the Transformers repository without adding the Optimum ORTTrainer, and it finished training within an hour -- it also did not print out any terminal output beyond the expected status bars and warnings.

Finally, I tried modifying the examples run_glue.py script from the Transformers repository to add the Optimum ORTTrainer, and it printed a lot of terminal output with the ONNX graph information, such that the status bar if it was printed was obscured.

I did not run into any error messages, just strange behavior with the training hanging, the logs, and the unnaturally long training time.

Thanks for your time! Please let me know if I set up my environment incorrectly etc.

Expected behavior

Trains successfully -- I ran the corresponding examples run_glue.py script from the Transformers repository with the same arguments and it finished training within the hour.

The text was updated successfully, but these errors were encountered:

JingyaHuang · 2022-07-19T12:49:58Z

I observe that in the implementation of DeBERTa in transformers, there are some numpy/math operations that led to incorrect export. See details here.

As the fairscale distributed (simple) works correctly with ORTTrainer for other models, I suspect that the unnormal training behaviors come from the fact that ONNX subgraphs are not correctly traced.

I will open a PR in transformers to correct this, and then check if this is the root of the issue.

JingyaHuang · 2022-08-03T16:39:45Z

Hi @carzh,

Some updates on the issue, the problem comes from the implementation of DeBERTas in transformers:

The root cause of the failure is that XDropOut didn't have a symbolic function. And it is now implemented by @garymm in support ONNX export of XDropout in deberta{,_v2} and sew_d transformers#17502 and has just been merged to the main of transformers.
Another problem with the implementation of DeBERTa as I mentioned shall be fixed in this PR Deberta V2: Fix critical trace warnings to allow ONNX export transformers#18272, this one fixes some problems that we encountered during the inference.

I just tested with the transformers after both fixes, and now the distributed training works for fp32, while failed for fp16 with the following error message:

RuntimeError: /onnxruntime_src/orttraining/orttraining/python/orttraining_pybind_state.cc:713 onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*, const pybind11::bytes&, const onnxruntime::training::OrtModuleGraphBuilderConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : Type Error: Type parameter (T) of Optype (MatMul) bound to different types (tensor(float) and tensor(float16) in node (MatMul_232).

It seems that the inputs of a MatMul have mismatch dtype, which is quite similar to the previous problem that we met with the training of gpt2. I will continue on debugging it this week. And I currently put all fixes of DeBERTas here.

JingyaHuang · 2022-08-08T15:39:26Z

Hi @carzh, I just opened a PR in transformers to fix this issue.
I tested it from my end, and it enables the distributed mixed-precision training with DeBERTas. Can you also test from your side by building transformers with this branch to check if it solves your issue? Thanks!

askhade · 2022-08-08T21:32:03Z

Thanks @JingyaHuang
Adding @zhijxu-MS who can help verify this change.

zhijxu-MS · 2022-08-09T02:49:44Z

@JingyaHuang @askhade this branch can run in my side.

JingyaHuang · 2022-08-09T08:36:00Z

Awesome, thanks for trying out @zhijxu-MS @askhade !

JingyaHuang · 2022-08-22T08:23:42Z

The fix has been merged in transformers, closing the issue.

carzh added the bug Something isn't working label Jul 19, 2022

JingyaHuang self-assigned this Jul 19, 2022

carzh mentioned this issue Jul 26, 2022

Commenting out RoBERTa & DeBERTa to prevent models from being run until issues are resolved microsoft/onnxruntime-training-examples#84

Closed

JingyaHuang mentioned this issue Aug 8, 2022

Fix ORTTrainer failure on DeBERTa(base/v2/sew_d) fp16 training huggingface/transformers#18529

Closed

JingyaHuang mentioned this issue Aug 11, 2022

Fix failure on DeBERTa(base/v2/sew_d) fp16 training with ONNX Runtime huggingface/transformers#18585

Merged

JingyaHuang closed this as completed Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimum's DeBERTa-V2 behavior strange when training with ORT (training hangs or takes impossibly long) #305

Optimum's DeBERTa-V2 behavior strange when training with ORT (training hangs or takes impossibly long) #305

carzh commented Jul 19, 2022 •

edited

Loading

JingyaHuang commented Jul 19, 2022

JingyaHuang commented Aug 3, 2022

JingyaHuang commented Aug 8, 2022 •

edited

Loading

askhade commented Aug 8, 2022

zhijxu-MS commented Aug 9, 2022

JingyaHuang commented Aug 9, 2022

JingyaHuang commented Aug 22, 2022

Optimum's DeBERTa-V2 behavior strange when training with ORT (training hangs or takes impossibly long) #305

Optimum's DeBERTa-V2 behavior strange when training with ORT (training hangs or takes impossibly long) #305

Comments

carzh commented Jul 19, 2022 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

JingyaHuang commented Jul 19, 2022

JingyaHuang commented Aug 3, 2022

JingyaHuang commented Aug 8, 2022 • edited Loading

askhade commented Aug 8, 2022

zhijxu-MS commented Aug 9, 2022

JingyaHuang commented Aug 9, 2022

JingyaHuang commented Aug 22, 2022

carzh commented Jul 19, 2022 •

edited

Loading

JingyaHuang commented Aug 8, 2022 •

edited

Loading