Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint (transformers>=4.35) #813

jeffhataws · 2024-01-10T16:42:55Z

(See also huggingface/transformers#28438)

I followed PyTorch Neuron for Trainium Hugging Face BERT MRPC task finetuning using Hugging Face Trainer API to fine-tune BERT. I ran run_2w.sh and see the following behavior where it runs normally until the first checkpoint is saved, but then starts doing compilation for every step (I changed save_steps option in run_2w.sh to 10 steps in order to trigger the issue faster):

[INFO|trainer.py:1712] 2024-01-09 17:04:08,045 >> ***** Running training *****                                                                                 
[INFO|trainer.py:1713] 2024-01-09 17:04:08,045 >>   Num examples = 1,840                                                                                       
[INFO|trainer.py:1714] 2024-01-09 17:04:08,045 >>   Num Epochs = 5                                                                                             
[INFO|trainer.py:1715] 2024-01-09 17:04:08,045 >>   Instantaneous batch size per device = 8                                                                    
[INFO|trainer.py:1718] 2024-01-09 17:04:08,045 >>   Total train batch size (w. parallel, distributed & accumulation) = 16                                      
[INFO|trainer.py:1719] 2024-01-09 17:04:08,045 >>   Gradient Accumulation steps = 1                                                                            
[INFO|trainer.py:1720] 2024-01-09 17:04:08,045 >>   Total optimization steps = 1,150                                                                           [INFO|trainer.py:1721] 2024-01-09 17:04:08,045 >>   Number of trainable parameters = 109,483,778                                                               
  0%|          | 0/1150 [00:00<?, ?it/s]2024-01-09 17:04:08.000173:  140637  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache          
2024-01-09 17:04:08.000175:  140637  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_1650
6334326618155050+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                              
  0%|          | 1/1150 [00:00<04:53,  3.92it/s]2024-01-09 17:04:09.000508:  140742  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache  2024-01-09 17:04:09.000603:  140742  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_2044
823947559839528+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
  0%|          | 2/1150 [00:02<29:23,  1.54s/it]2024-01-09 17:04:13.000328:  140780  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache  
2024-01-09 17:04:13.000442:  140780  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_7850
734058944619683+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
  1%|          | 10/1150 [00:09<08:40,  2.19it/s][INFO|trainer.py:2859] 2024-01-09 17:04:17,051 >> Saving model checkpoint to /tmp/mrpc/tmp-checkpoint-10 

(Done saving checkpoint, then compilation every step below)
    
2024-01-09 17:04:17.000789:  141260  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:17.000873:  141260  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_2523
922307180626946+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
2024-01-09 17:04:20.000215:  141270  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:20.000216:  141270  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_6208
462474369064908+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
2024-01-09 17:04:21.000202:  141279  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:21.000282:  141279  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_1498
3430005009285767+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                              
2024-01-09 17:04:23.000265:  141288  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:23.000266:  141288  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_3356
031905174227108+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
2024-01-09 17:04:24.000025:  141297  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:24.000104:  141297  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_5950
234423484734321+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
2024-01-09 17:04:26.000063:  141306  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:26.000064:  141306  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_1050
0036830841255848+abb26765/model.neff. Exiting with a successfully compiled graph.  

(Compilation repeated many times, and eventually run out of device memory in Neuron runtime)

This issue starts in transformers version 4.35.

The text was updated successfully, but these errors were encountered:

jeffhataws · 2024-01-24T20:38:22Z

The fix huggingface/transformers#28669 is merged. Expect to be part of HF transformers 4.38.

jeffhataws · 2024-02-02T19:36:01Z

The fix is part of the HF transformers patch release 4.37.2 .

jeffhataws · 2024-03-07T17:05:53Z

To use HF transformers 4.37.2 with Neuron, please add --save_safetensors False to the Trainer API (run_glue.py option for example) as well as run the following shell command to patch Trainer API (remove moving model to CPU before saving checkpoint):

    # Workaround https://github.com/aws-neuron/aws-neuron-sdk/issues/813
    sed -i "s/model\.to(\"cpu\")//" `python -c "import site; print(site.getsitepackages()[0])"`/trainer.py

jeffhataws changed the title ~~Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint~~ Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint (transformers>=4.35) Jan 10, 2024

jeffhataws mentioned this issue Jan 10, 2024

Multi-worker HF training using trainer API in torch-xla result in too many graph compilations after saving checkpoint (transformers>=4.35) huggingface/transformers#28438

Closed

4 tasks

jeffhataws closed this as completed Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint (transformers>=4.35) #813

Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint (transformers>=4.35) #813

jeffhataws commented Jan 10, 2024 •

edited

Loading

jeffhataws commented Jan 24, 2024

jeffhataws commented Feb 2, 2024

jeffhataws commented Mar 7, 2024

Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint (transformers>=4.35) #813

Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint (transformers>=4.35) #813

Comments

jeffhataws commented Jan 10, 2024 • edited Loading

jeffhataws commented Jan 24, 2024

jeffhataws commented Feb 2, 2024

jeffhataws commented Mar 7, 2024

jeffhataws commented Jan 10, 2024 •

edited

Loading