RuntimeError: CUDA error: an illegal memory access was encountered [When Running SFT on Qwen2.5] #1991

Malikeh97 · 2024-10-23T03:57:07Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Hi,
We ran the given YAML config to train Qwen2.5-14b-Instruct via supervised-finetuning (SFT) following the guidelines on the Axolotl repo. For this, we ran the codes on "Deep Learning AMI", which is Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) 20241016.

Current behaviour

However, we get RuntimeError: CUDA error: an illegal memory access was encountered while training the model. I should add that we get the same error with LoRA config and with Llama 3 8B model.

Steps to reproduce

Follow the Quick Start installation guidelines on Axolotl READMe file
Run the yaml config with the following command: nohup accelerate launch -m axolotl.cli.train /home/ubuntu/qwen2.5_14B.yml > training_output.log 2>&1 &

Config yaml

base_model: Qwen/Qwen2.5-14B-Instruct
model_type: AutoModelForCausalLM #nohup accelerate launch -m axolotl.cli.train /home/ubuntu/qwen2.5_14B.yml > training_output.log 2>&1 &
tokenizer_type: AutoTokenizer
trust_remote_code: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: /path/to/my/dataset.jsonl
    type: sharegpt
    conversation: chatml

chat_template: chatml
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./outputs/out

sequence_len: 8192
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true

unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# input_layernorm layers
- model.layers.0.input_layernorm
- model.layers.1.input_layernorm
- model.layers.2.input_layernorm
- model.layers.3.input_layernorm
- model.layers.4.input_layernorm
- model.layers.5.input_layernorm
- model.layers.6.input_layernorm
- model.layers.7.input_layernorm
- model.layers.8.input_layernorm
- model.layers.9.input_layernorm
- model.layers.10.input_layernorm
- model.layers.11.input_layernorm
- model.layers.12.input_layernorm
- model.layers.13.input_layernorm
- model.layers.14.input_layernorm
- model.layers.15.input_layernorm
- model.layers.16.input_layernorm
- model.layers.17.input_layernorm
- model.layers.18.input_layernorm
- model.layers.19.input_layernorm
- model.layers.20.input_layernorm
- model.layers.21.input_layernorm
- model.layers.22.input_layernorm
- model.layers.23.input_layernorm
# lm_head layers
# mlp.down_proj layers
- model.layers.1.mlp.down_proj
- model.layers.35.mlp.down_proj
- model.layers.38.mlp.down_proj
- model.layers.37.mlp.down_proj
- model.layers.36.mlp.down_proj
- model.layers.15.mlp.down_proj
- model.layers.11.mlp.down_proj
- model.layers.12.mlp.down_proj
- model.layers.34.mlp.down_proj
- model.layers.44.mlp.down_proj
- model.layers.45.mlp.down_proj
- model.layers.9.mlp.down_proj
- model.layers.41.mlp.down_proj
- model.layers.33.mlp.down_proj
- model.layers.43.mlp.down_proj
- model.layers.40.mlp.down_proj
- model.layers.13.mlp.down_proj
- model.layers.8.mlp.down_proj
- model.layers.39.mlp.down_proj
- model.layers.10.mlp.down_proj
- model.layers.14.mlp.down_proj
- model.layers.16.mlp.down_proj
- model.layers.31.mlp.down_proj
- model.layers.32.mlp.down_proj
# mlp.gate_proj layers
- model.layers.1.mlp.gate_proj
- model.layers.44.mlp.gate_proj
- model.layers.46.mlp.gate_proj
- model.layers.45.mlp.gate_proj
- model.layers.43.mlp.gate_proj
- model.layers.47.mlp.gate_proj
- model.layers.42.mlp.gate_proj
- model.layers.32.mlp.gate_proj
- model.layers.27.mlp.gate_proj
- model.layers.33.mlp.gate_proj
- model.layers.28.mlp.gate_proj
- model.layers.39.mlp.gate_proj
- model.layers.41.mlp.gate_proj
- model.layers.40.mlp.gate_proj
- model.layers.30.mlp.gate_proj
- model.layers.29.mlp.gate_proj
- model.layers.31.mlp.gate_proj
- model.layers.26.mlp.gate_proj
- model.layers.37.mlp.gate_proj
- model.layers.10.mlp.gate_proj
- model.layers.38.mlp.gate_proj
- model.layers.12.mlp.gate_proj
- model.layers.36.mlp.gate_proj
- model.layers.13.mlp.gate_proj
# mlp.up_proj layers
- model.layers.1.mlp.up_proj
- model.layers.13.mlp.up_proj
- model.layers.11.mlp.up_proj
- model.layers.14.mlp.up_proj
- model.layers.15.mlp.up_proj
- model.layers.12.mlp.up_proj
- model.layers.8.mlp.up_proj
- model.layers.16.mlp.up_proj
- model.layers.9.mlp.up_proj
- model.layers.19.mlp.up_proj
- model.layers.10.mlp.up_proj
- model.layers.7.mlp.up_proj
- model.layers.17.mlp.up_proj
- model.layers.20.mlp.up_proj
- model.layers.21.mlp.up_proj
- model.layers.18.mlp.up_proj
- model.layers.38.mlp.up_proj
- model.layers.37.mlp.up_proj
- model.layers.39.mlp.up_proj
- model.layers.42.mlp.up_proj
- model.layers.41.mlp.up_proj
- model.layers.27.mlp.up_proj
- model.layers.28.mlp.up_proj
- model.layers.34.mlp.up_proj
# model.norm layers
# post_attention_layernorm layers
- model.layers.0.post_attention_layernorm
- model.layers.1.post_attention_layernorm
- model.layers.2.post_attention_layernorm
- model.layers.3.post_attention_layernorm
- model.layers.4.post_attention_layernorm
- model.layers.5.post_attention_layernorm
- model.layers.6.post_attention_layernorm
- model.layers.7.post_attention_layernorm
- model.layers.8.post_attention_layernorm
- model.layers.9.post_attention_layernorm
- model.layers.10.post_attention_layernorm
- model.layers.11.post_attention_layernorm
- model.layers.12.post_attention_layernorm
- model.layers.13.post_attention_layernorm
- model.layers.14.post_attention_layernorm
- model.layers.15.post_attention_layernorm
- model.layers.16.post_attention_layernorm
- model.layers.17.post_attention_layernorm
- model.layers.18.post_attention_layernorm
- model.layers.19.post_attention_layernorm
- model.layers.20.post_attention_layernorm
- model.layers.21.post_attention_layernorm
- model.layers.22.post_attention_layernorm
- model.layers.23.post_attention_layernorm
# self_attn.k_proj layers
- model.layers.47.self_attn.k_proj
- model.layers.39.self_attn.k_proj
- model.layers.41.self_attn.k_proj
- model.layers.37.self_attn.k_proj
- model.layers.35.self_attn.k_proj
- model.layers.44.self_attn.k_proj
- model.layers.38.self_attn.k_proj
- model.layers.14.self_attn.k_proj
- model.layers.7.self_attn.k_proj
- model.layers.12.self_attn.k_proj
- model.layers.11.self_attn.k_proj
- model.layers.32.self_attn.k_proj
- model.layers.10.self_attn.k_proj
- model.layers.8.self_attn.k_proj
- model.layers.9.self_attn.k_proj
- model.layers.6.self_attn.k_proj
- model.layers.45.self_attn.k_proj
- model.layers.42.self_attn.k_proj
- model.layers.5.self_attn.k_proj
- model.layers.40.self_attn.k_proj
- model.layers.33.self_attn.k_proj
- model.layers.0.self_attn.k_proj
- model.layers.34.self_attn.k_proj
- model.layers.13.self_attn.k_proj
# self_attn.o_proj layers
- model.layers.12.self_attn.o_proj
- model.layers.5.self_attn.o_proj
- model.layers.14.self_attn.o_proj
- model.layers.16.self_attn.o_proj
- model.layers.20.self_attn.o_proj
- model.layers.13.self_attn.o_proj
- model.layers.11.self_attn.o_proj
- model.layers.4.self_attn.o_proj
- model.layers.6.self_attn.o_proj
- model.layers.19.self_attn.o_proj
- model.layers.7.self_attn.o_proj
- model.layers.18.self_attn.o_proj
- model.layers.8.self_attn.o_proj
- model.layers.38.self_attn.o_proj
- model.layers.15.self_attn.o_proj
- model.layers.17.self_attn.o_proj
- model.layers.9.self_attn.o_proj
- model.layers.10.self_attn.o_proj
- model.layers.21.self_attn.o_proj
- model.layers.28.self_attn.o_proj
- model.layers.32.self_attn.o_proj
- model.layers.35.self_attn.o_proj
- model.layers.39.self_attn.o_proj
- model.layers.3.self_attn.o_proj
# self_attn.q_proj layers
- model.layers.1.self_attn.q_proj
- model.layers.2.self_attn.q_proj
- model.layers.3.self_attn.q_proj
- model.layers.44.self_attn.q_proj
- model.layers.29.self_attn.q_proj
- model.layers.45.self_attn.q_proj
- model.layers.43.self_attn.q_proj
- model.layers.32.self_attn.q_proj
- model.layers.38.self_attn.q_proj
- model.layers.19.self_attn.q_proj
- model.layers.42.self_attn.q_proj
- model.layers.34.self_attn.q_proj
- model.layers.36.self_attn.q_proj
- model.layers.40.self_attn.q_proj
- model.layers.26.self_attn.q_proj
- model.layers.20.self_attn.q_proj
- model.layers.39.self_attn.q_proj
- model.layers.28.self_attn.q_proj
- model.layers.35.self_attn.q_proj
- model.layers.41.self_attn.q_proj
- model.layers.33.self_attn.q_proj
- model.layers.25.self_attn.q_proj
- model.layers.30.self_attn.q_proj
- model.layers.27.self_attn.q_proj
# self_attn.v_proj layers
- model.layers.0.self_attn.v_proj
- model.layers.7.self_attn.v_proj
- model.layers.39.self_attn.v_proj
- model.layers.31.self_attn.v_proj
- model.layers.15.self_attn.v_proj
- model.layers.10.self_attn.v_proj
- model.layers.32.self_attn.v_proj
- model.layers.41.self_attn.v_proj
- model.layers.6.self_attn.v_proj
- model.layers.33.self_attn.v_proj
- model.layers.42.self_attn.v_proj
- model.layers.29.self_attn.v_proj
- model.layers.14.self_attn.v_proj
- model.layers.9.self_attn.v_proj
- model.layers.35.self_attn.v_proj
- model.layers.38.self_attn.v_proj
- model.layers.13.self_attn.v_proj
- model.layers.30.self_attn.v_proj
- model.layers.5.self_attn.v_proj
- model.layers.34.self_attn.v_proj
- model.layers.28.self_attn.v_proj
- model.layers.37.self_attn.v_proj
- model.layers.27.self_attn.v_proj
- model.layers.11.self_attn.v_proj
# model.embed_tokens layers


gradient_accumulation_steps: 16
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_torch_fused
lr_scheduler: linear
learning_rate: 5e-6

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

gradient_checkpointing: unsloth
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 2
saves_per_epoch: 1
save_total_limit: 4
debug:
deepspeed: /home/ubuntu/axolotl/deepspeed_configs/zero3_bf16.json
weight_decay: 0.05
special_tokens:
  eos_token: <|im_end|>

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

python 3.12.6

axolotl branch-commit

main/718cfb2dd1ff2a03b89e3b95f0b1aa1e04046e6e

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2024-10-24T06:37:36Z

Hey! Unfortunately, I do not have that machine to replicate this. I tested llama3 8B (using the example config) in the past few days, and it worked on multi-gpu.

Would you be able to start from any example config to ensure it works first? Alternatively, does provisioning a separate machine helps?

ehartford · 2024-10-24T07:22:36Z

Try without spectrum

williambarberjr · 2024-10-24T14:28:37Z

I work with Malikeh - we've tried without spectrum. Specifically we've tried with existing yml's for Loras of L3.1 8B and get the exact same error at the same point in the launch process as well. Both yml's run fine on cloud providers with pre defined Axolotl setups. We've also tried using the docker commands in the readme and we get the exact same error when using the docker setup. We've also tried within a venv and outside of a venv. We've tried using uv venv and uv pip install as well as with standard venv creation using python -m venv and pip install. In all cases, the same error persists at the same point during launch.

We could start from a true fresh base Ubuntu install. It's just a lot more painful because cuda, pip, etc. are not yet installed so setup takes quite a bit longer. For what it's worth we're also modifying the requirements.txt file and setup.py file to remove use of autoawq each time because pip install autoawq also fails every time but we don't need it for our training run.

chiwanpark · 2024-10-24T16:03:22Z

Could you try without sample packing? I have a similar problem that LoRA training for Qwen2 is failed when the sample packing is enabled, and I'm trying to identify the cause of problem.

williambarberjr · 2024-10-24T20:54:59Z

Isn't sample packing a very significant (like 2x+) improvement in total training speed? In addition, it helps minimize the impact of the gradient accumulation bug. I'd much prefer to be able to continue using sample packing if possible but we can experiment with turning it off to see if that helps us diagnose the issue.

NanoCode012 · 2024-10-29T06:46:18Z

Hey @Malikeh97 @williambarberjr , I can reproduce this issue on 2xA100 SXM. I noticed that it still exists if I turn off liger, spectrum, Unsloth grad checkpointing, and swap to Llama model. I'll need some time to narrow down the core issue.

Are you able to test this config examples/llama-3/fft-8b-liger-fsdp.yaml (use this modified https://github.com/axolotl-ai-cloud/axolotl/blob/a3085c6444da0f716dca0becc2f8864b2998d278/examples/llama-3/fft-8b-liger-fsdp.yaml)? I got this working on my setup.

Specifically we've tried with existing yml's for Loras of L3.1 8B and get the exact same error at the same point in the launch process as well. Both yml's run fine on cloud providers with pre defined Axolotl setups.

Sorry, do you mean it works or does not?

Edit: I've narrowed it down to trust_remote_code: true. I believe the model loading follows a slightly different path if loading remote code. Will investigate.

chiragjn · 2024-10-30T10:14:31Z

I am running into same issue but with llama-3.2-1B-instruct / Qwen2.5-0.5B-instruct qlora + deepspeed zero 2 no offload when sample packing is enabled, model forward itself breaks
Disabled gradient checkpointing, liger, set gradient accumulation to 1, nothing helps in my case

Only when I disable sample_packing it works fine, I am suspecting it is an issue with latest version of flash attn (2.6.3) or sample packing implementation is incompatible

I'll report back if I figure out something

NanoCode012 · 2024-10-30T10:21:36Z

@chiragjn , did you have trust_remote_code enabled?

chiragjn · 2024-10-30T10:34:21Z

Yes
You are right @NanoCode012, setting trust_remote_code to False also gets it working
Very interesting, time to check with the debugger 😅

One bonkers side observation is very high random spikes in GPU memory that cause OOM. A 1B model in qlora bf16 with batch size 4, sequence length 4096, should not OOM with a 48GB GPU - will investigate

chiragjn · 2024-10-30T11:04:22Z

I think this is a problem since transformers 4.43 when they moved _get_unpad_data

huggingface/transformers@e314395#diff-1cc408601d83b77ccf2d6daea98099be941686d4aa082fe61e806e0d6b314d06

The condition for patching this with axolotl relies on trust_remote_code to be false 😅

axolotl/src/axolotl/monkeypatch/multipack.py

Lines 35 to 38 in 8c3a727

    
           elif hasattr(transformers, "modeling_flash_attention_utils") and not is_remote_code: 
        
               transformers.modeling_flash_attention_utils._get_unpad_data = (  # pylint: disable=protected-access 
        
                   get_unpad_data 
        
               )

If I understand things correctly, this should check for custom code looking at the config.json not just trust_remote_code flag

chiragjn · 2024-10-30T11:38:53Z

I am able to confirm it working with latest transformers with this change truefoundry@af48625

Malikeh97 added the bug Something isn't working label Oct 23, 2024

NanoCode012 self-assigned this Oct 29, 2024

NanoCode012 added the under review label Oct 30, 2024

chiragjn mentioned this issue Nov 4, 2024

Update get_unpad_data patching for multipack #2013

Merged

winglian closed this as completed in #2013 Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: an illegal memory access was encountered [When Running SFT on Qwen2.5] #1991

RuntimeError: CUDA error: an illegal memory access was encountered [When Running SFT on Qwen2.5] #1991

Malikeh97 commented Oct 23, 2024 •

edited

Loading

NanoCode012 commented Oct 24, 2024

ehartford commented Oct 24, 2024

williambarberjr commented Oct 24, 2024

chiwanpark commented Oct 24, 2024 •

edited

Loading

williambarberjr commented Oct 24, 2024

NanoCode012 commented Oct 29, 2024 •

edited

Loading

chiragjn commented Oct 30, 2024

NanoCode012 commented Oct 30, 2024

chiragjn commented Oct 30, 2024 •

edited

Loading

chiragjn commented Oct 30, 2024

chiragjn commented Oct 30, 2024

RuntimeError: CUDA error: an illegal memory access was encountered [When Running SFT on Qwen2.5] #1991

RuntimeError: CUDA error: an illegal memory access was encountered [When Running SFT on Qwen2.5] #1991

Comments

Malikeh97 commented Oct 23, 2024 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Oct 24, 2024

ehartford commented Oct 24, 2024

williambarberjr commented Oct 24, 2024

chiwanpark commented Oct 24, 2024 • edited Loading

williambarberjr commented Oct 24, 2024

NanoCode012 commented Oct 29, 2024 • edited Loading

chiragjn commented Oct 30, 2024

NanoCode012 commented Oct 30, 2024

chiragjn commented Oct 30, 2024 • edited Loading

chiragjn commented Oct 30, 2024

chiragjn commented Oct 30, 2024

Malikeh97 commented Oct 23, 2024 •

edited

Loading

chiwanpark commented Oct 24, 2024 •

edited

Loading

NanoCode012 commented Oct 29, 2024 •

edited

Loading

chiragjn commented Oct 30, 2024 •

edited

Loading