-
-
Notifications
You must be signed in to change notification settings - Fork 940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA error: an illegal memory access was encountered [When Running SFT on Qwen2.5] #1991
Comments
Hey! Unfortunately, I do not have that machine to replicate this. I tested llama3 8B (using the example config) in the past few days, and it worked on multi-gpu. Would you be able to start from any example config to ensure it works first? Alternatively, does provisioning a separate machine helps? |
Try without spectrum |
I work with Malikeh - we've tried without spectrum. Specifically we've tried with existing yml's for Loras of L3.1 8B and get the exact same error at the same point in the launch process as well. Both yml's run fine on cloud providers with pre defined Axolotl setups. We've also tried using the docker commands in the readme and we get the exact same error when using the docker setup. We've also tried within a venv and outside of a venv. We've tried using uv venv and uv pip install as well as with standard venv creation using python -m venv and pip install. In all cases, the same error persists at the same point during launch. We could start from a true fresh base Ubuntu install. It's just a lot more painful because cuda, pip, etc. are not yet installed so setup takes quite a bit longer. For what it's worth we're also modifying the |
Could you try without sample packing? I have a similar problem that LoRA training for Qwen2 is failed when the sample packing is enabled, and I'm trying to identify the cause of problem. |
Isn't sample packing a very significant (like 2x+) improvement in total training speed? In addition, it helps minimize the impact of the gradient accumulation bug. I'd much prefer to be able to continue using sample packing if possible but we can experiment with turning it off to see if that helps us diagnose the issue. |
Hey @Malikeh97 @williambarberjr , I can reproduce this issue on 2xA100 SXM. I noticed that it still exists if I turn off liger, spectrum, Unsloth grad checkpointing, and swap to Llama model. I'll need some time to narrow down the core issue. Are you able to test this config
Sorry, do you mean it works or does not? Edit: I've narrowed it down to |
I am running into same issue but with llama-3.2-1B-instruct / Qwen2.5-0.5B-instruct qlora + deepspeed zero 2 no offload when sample packing is enabled, model forward itself breaks Only when I disable I'll report back if I figure out something |
@chiragjn , did you have |
Yes |
I think this is a problem since transformers 4.43 when they moved _get_unpad_data The condition for patching this with axolotl relies on trust_remote_code to be false 😅 axolotl/src/axolotl/monkeypatch/multipack.py Lines 35 to 38 in 8c3a727
If I understand things correctly, this should check for custom code looking at the config.json not just |
I am able to confirm it working with latest transformers with this change truefoundry@af48625 |
Please check that this issue hasn't been reported before.
Expected Behavior
Hi,
We ran the given YAML config to train Qwen2.5-14b-Instruct via supervised-finetuning (SFT) following the guidelines on the Axolotl repo. For this, we ran the codes on "Deep Learning AMI", which is
Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) 20241016
.Current behaviour
However, we get
RuntimeError: CUDA error: an illegal memory access was encountered
while training the model. I should add that we get the same error with LoRA config and with Llama 3 8B model.Steps to reproduce
nohup accelerate launch -m axolotl.cli.train /home/ubuntu/qwen2.5_14B.yml > training_output.log 2>&1 &
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
python 3.12.6
axolotl branch-commit
main/718cfb2dd1ff2a03b89e3b95f0b1aa1e04046e6e
Acknowledgements
The text was updated successfully, but these errors were encountered: