Accelerate + Dynamo broken in 4.46.0 due to model loss functions refactor #34402

AbrahamSanders · 2024-10-25T06:50:51Z

System Info

transformers version: 4.46.0
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.9.16
Huggingface_hub version: 0.24.0
Safetensors version: 0.4.5
Accelerate version: 1.0.1
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: True
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR'}
PyTorch version (GPU?): 2.5.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: NVIDIA RTX A6000

Who can help?

@muellerzr @ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

#34191 introduced custom loss functions to the model classes. This appears to have broken training with accelerate + torch dynamo.

To reproduce, use run_clm.py with the following accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
dynamo_config:
  dynamo_backend: INDUCTOR
enable_cpu_affinity: true
gpu_ids: '0'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

accelerate launch run_clm.py \
    --log_level info \
    --model_name_or_path=meta-llama/Llama-3.2-1B \
    --dataset_name=Salesforce/wikitext \
    --dataset_config_name=wikitext-2-raw-v1 \
    --block_size=1024 \
    --per_device_train_batch_size=4 \
    --do_train \
    --bf16 \
    --output_dir=Llama-3.2-1B-wikitext-2-raw-v1 \
    --overwrite_output_dir \
    --seed=42 \
    --logging_steps=10 \
    --lr_scheduler_type=cosine \
    --num_train_epochs=3 \
    --learning_rate=5e-05 \
    --warmup_ratio=0.03 \
    --dataloader_drop_last

This produces an error from dynamo relating to the new model_cls.loss_function attribute added in #34191:

transformers/src/transformers/models/llama/modeling_llama.py

Lines 1209 to 1211 in 239a256

    
           loss = None 
        
           if labels is not None: 
        
               loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **loss_kwargs)

Important part of the traceback:

  File "/anaconda3/envs/dev/lib/python3.9/site-packages/torch/_dynamo/variables/functions.py", line 152, in __init__
    assert isinstance(
AssertionError: expected FunctionType found _lru_cache_wrapper <functools._lru_cache_wrapper object at 0x7f1091109a40>

from user code:
   File "/anaconda3/envs/dev/lib/python3.9/site-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
  File "/anaconda3/envs/dev/lib/python3.9/site-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/anaconda3/envs/dev/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
  File "/anaconda3/envs/dev/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1214, in forward
    loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **loss_kwargs)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

Now, if you update the accelerate config to not use dynamo, it runs just fine:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
enable_cpu_affinity: true
gpu_ids: '0'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Expected behavior

Accelerate should not throw the error when using torch dynamo.

The text was updated successfully, but these errors were encountered:

Fixes huggingface#34402 Remove the `lru_cache` decorator from the `loss_function` attribute in the `LlamaForCausalLM` class. * Ensure the `loss_function` is a `FunctionType` in the `forward` method of the `LlamaForCausalLM` class. * Update the `__init__` method to include parentheses around the `layer_idx` check. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/huggingface/transformers/issues/34402?shareId=XXXX-XXXX-XXXX-XXXX).

ArthurZucker · 2024-10-29T10:20:12Z

I think removing the @lru_cache is the fix no? Can you confirm that this resolves your issue? 🤗

fzyzcjy · 2024-10-31T07:13:19Z

Hi, I am seeing the same issue. Looking forward to the fix!

AbrahamSanders added the bug label Oct 25, 2024

Ryukijano mentioned this issue Oct 27, 2024

Fix loss function compatibility with torch dynamo #34442

Closed

muellerzr mentioned this issue Oct 30, 2024

Update trainer for easier handling of accumulate, compile fixes, and proper reporting #34511

Merged

5 tasks

muellerzr closed this as completed in #34511 Nov 4, 2024

ChanderG mentioned this issue Nov 5, 2024

Torch.compile Graph break introduced due to new loss function api #34615

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate + Dynamo broken in 4.46.0 due to model loss functions refactor #34402

Accelerate + Dynamo broken in 4.46.0 due to model loss functions refactor #34402

AbrahamSanders commented Oct 25, 2024 •

edited

Loading

ArthurZucker commented Oct 29, 2024

fzyzcjy commented Oct 31, 2024 •

edited

Loading

Accelerate + Dynamo broken in 4.46.0 due to model loss functions refactor #34402

Accelerate + Dynamo broken in 4.46.0 due to model loss functions refactor #34402

Comments

AbrahamSanders commented Oct 25, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Oct 29, 2024

fzyzcjy commented Oct 31, 2024 • edited Loading

AbrahamSanders commented Oct 25, 2024 •

edited

Loading

fzyzcjy commented Oct 31, 2024 •

edited

Loading