You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[INFO|trainer.py:1712] 2024-01-09 17:04:08,045 >> ***** Running training *****
[INFO|trainer.py:1713] 2024-01-09 17:04:08,045 >> Num examples = 1,840
[INFO|trainer.py:1714] 2024-01-09 17:04:08,045 >> Num Epochs = 5
[INFO|trainer.py:1715] 2024-01-09 17:04:08,045 >> Instantaneous batch size per device = 8
[INFO|trainer.py:1718] 2024-01-09 17:04:08,045 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1719] 2024-01-09 17:04:08,045 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1720] 2024-01-09 17:04:08,045 >> Total optimization steps = 1,150 [INFO|trainer.py:1721] 2024-01-09 17:04:08,045 >> Number of trainable parameters = 109,483,778
0%| | 0/1150 [00:00<?, ?it/s]2024-01-09 17:04:08.000173: 140637 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-09 17:04:08.000175: 140637 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_1650
6334326618155050+abb26765/model.neff. Exiting with a successfully compiled graph.
0%| | 1/1150 [00:00<04:53, 3.92it/s]2024-01-09 17:04:09.000508: 140742 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache 2024-01-09 17:04:09.000603: 140742 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_2044
823947559839528+abb26765/model.neff. Exiting with a successfully compiled graph.
0%| | 2/1150 [00:02<29:23, 1.54s/it]2024-01-09 17:04:13.000328: 140780 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-09 17:04:13.000442: 140780 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_7850
734058944619683+abb26765/model.neff. Exiting with a successfully compiled graph.
1%| | 10/1150 [00:09<08:40, 2.19it/s][INFO|trainer.py:2859] 2024-01-09 17:04:17,051 >> Saving model checkpoint to /tmp/mrpc/tmp-checkpoint-10
(Done saving checkpoint, then compilation every step below)
2024-01-09 17:04:17.000789: 141260 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-09 17:04:17.000873: 141260 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_2523
922307180626946+abb26765/model.neff. Exiting with a successfully compiled graph.
2024-01-09 17:04:20.000215: 141270 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-09 17:04:20.000216: 141270 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_6208
462474369064908+abb26765/model.neff. Exiting with a successfully compiled graph.
2024-01-09 17:04:21.000202: 141279 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-09 17:04:21.000282: 141279 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_1498
3430005009285767+abb26765/model.neff. Exiting with a successfully compiled graph.
2024-01-09 17:04:23.000265: 141288 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-09 17:04:23.000266: 141288 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_3356
031905174227108+abb26765/model.neff. Exiting with a successfully compiled graph.
2024-01-09 17:04:24.000025: 141297 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-09 17:04:24.000104: 141297 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_5950
234423484734321+abb26765/model.neff. Exiting with a successfully compiled graph.
2024-01-09 17:04:26.000063: 141306 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-09 17:04:26.000064: 141306 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_1050
0036830841255848+abb26765/model.neff. Exiting with a successfully compiled graph.
(Compilation repeated many times, and eventually run out of device memory in Neuron runtime)
This issue starts in transformers version 4.35.
The text was updated successfully, but these errors were encountered:
jeffhataws
changed the title
Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint
Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint (transformers>=4.35)
Jan 10, 2024
To use HF transformers 4.37.2 with Neuron, please add --save_safetensors False to the Trainer API (run_glue.py option for example) as well as run the following shell command to patch Trainer API (remove moving model to CPU before saving checkpoint):
(See also huggingface/transformers#28438)
I followed PyTorch Neuron for Trainium Hugging Face BERT MRPC task finetuning using Hugging Face Trainer API to fine-tune BERT. I ran run_2w.sh and see the following behavior where it runs normally until the first checkpoint is saved, but then starts doing compilation for every step (I changed save_steps option in run_2w.sh to 10 steps in order to trigger the issue faster):
This issue starts in transformers version 4.35.
The text was updated successfully, but these errors were encountered: