Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint (transformers>=4.35) #813

Closed
jeffhataws opened this issue Jan 10, 2024 · 3 comments

Comments

@jeffhataws
Copy link
Contributor

jeffhataws commented Jan 10, 2024

(See also huggingface/transformers#28438)

I followed PyTorch Neuron for Trainium Hugging Face BERT MRPC task finetuning using Hugging Face Trainer API to fine-tune BERT. I ran run_2w.sh and see the following behavior where it runs normally until the first checkpoint is saved, but then starts doing compilation for every step (I changed save_steps option in run_2w.sh to 10 steps in order to trigger the issue faster):

[INFO|trainer.py:1712] 2024-01-09 17:04:08,045 >> ***** Running training *****                                                                                 
[INFO|trainer.py:1713] 2024-01-09 17:04:08,045 >>   Num examples = 1,840                                                                                       
[INFO|trainer.py:1714] 2024-01-09 17:04:08,045 >>   Num Epochs = 5                                                                                             
[INFO|trainer.py:1715] 2024-01-09 17:04:08,045 >>   Instantaneous batch size per device = 8                                                                    
[INFO|trainer.py:1718] 2024-01-09 17:04:08,045 >>   Total train batch size (w. parallel, distributed & accumulation) = 16                                      
[INFO|trainer.py:1719] 2024-01-09 17:04:08,045 >>   Gradient Accumulation steps = 1                                                                            
[INFO|trainer.py:1720] 2024-01-09 17:04:08,045 >>   Total optimization steps = 1,150                                                                           [INFO|trainer.py:1721] 2024-01-09 17:04:08,045 >>   Number of trainable parameters = 109,483,778                                                               
  0%|          | 0/1150 [00:00<?, ?it/s]2024-01-09 17:04:08.000173:  140637  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache          
2024-01-09 17:04:08.000175:  140637  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_1650
6334326618155050+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                              
  0%|          | 1/1150 [00:00<04:53,  3.92it/s]2024-01-09 17:04:09.000508:  140742  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache  2024-01-09 17:04:09.000603:  140742  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_2044
823947559839528+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
  0%|          | 2/1150 [00:02<29:23,  1.54s/it]2024-01-09 17:04:13.000328:  140780  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache  
2024-01-09 17:04:13.000442:  140780  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_7850
734058944619683+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
  1%|          | 10/1150 [00:09<08:40,  2.19it/s][INFO|trainer.py:2859] 2024-01-09 17:04:17,051 >> Saving model checkpoint to /tmp/mrpc/tmp-checkpoint-10 

(Done saving checkpoint, then compilation every step below)
    
2024-01-09 17:04:17.000789:  141260  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:17.000873:  141260  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_2523
922307180626946+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
2024-01-09 17:04:20.000215:  141270  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:20.000216:  141270  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_6208
462474369064908+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
2024-01-09 17:04:21.000202:  141279  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:21.000282:  141279  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_1498
3430005009285767+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                              
2024-01-09 17:04:23.000265:  141288  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:23.000266:  141288  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_3356
031905174227108+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
2024-01-09 17:04:24.000025:  141297  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:24.000104:  141297  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_5950
234423484734321+abb26765/model.neff. Exiting with a successfully compiled graph.                                                                               
2024-01-09 17:04:26.000063:  141306  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache                                                  
2024-01-09 17:04:26.000064:  141306  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.11.0.35+4f5279863/MODULE_1050
0036830841255848+abb26765/model.neff. Exiting with a successfully compiled graph.  

(Compilation repeated many times, and eventually run out of device memory in Neuron runtime)
                               

This issue starts in transformers version 4.35.

@jeffhataws jeffhataws changed the title Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint Multi-worker HF training using trainer API result in too many graph compilations after saving checkpoint (transformers>=4.35) Jan 10, 2024
@jeffhataws
Copy link
Contributor Author

The fix huggingface/transformers#28669 is merged. Expect to be part of HF transformers 4.38.

@jeffhataws
Copy link
Contributor Author

The fix is part of the HF transformers patch release 4.37.2 .

@jeffhataws
Copy link
Contributor Author

To use HF transformers 4.37.2 with Neuron, please add --save_safetensors False to the Trainer API (run_glue.py option for example) as well as run the following shell command to patch Trainer API (remove moving model to CPU before saving checkpoint):

    # Workaround https://github.com/aws-neuron/aws-neuron-sdk/issues/813
    sed -i "s/model\.to(\"cpu\")//" `python -c "import site; print(site.getsitepackages()[0])"`/trainer.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant