-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. #186
Comments
I tried ===================================BUG REPORT=================================== python -m bitsandbytes and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues bin /home/learner/anaconda3/envs/StockQlora/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda121.so 0%| | 0/1875 [00:02<?, ?it/s] |
When I turned the device into "cpu", it returns another error.
|
Try to update the accelerate version. |
Well, I tried but it's not the same problem like this one ( huggingface/peft#414). My accelerate already been 0.21.0.dev0 and pip install returns 'already satisfied'. But thanks. I think this is probably a quantitative question and would like to ask what kind of gpu you are using, because I have noticed that you seem to be running this library successfully. |
Have you installed |
Already installed. You can successfully run the code in A100-80GB, and I get the same GPU, then the only difference should be that I got multiple gpu and it brings me this error? What should I do with that? Which model did you use? Could you run the finetune_guancao_7B.sh ? |
Actually I used 3 A100 GPUs, check the following requirements.txt (
|
I tried it and it brings the same error. Thanks anyway. |
It seems that the code is default trained on the first gpu, but that gpu is occupied by other program. So when I run it, I set auto and it used the cuda:0, but it's OOM. But when I set it to the free gpu like cuda:6 / cuda:7, it loads the model there but trains on the cuda:0. It might be the reason and I can't find any other explanation as others can run the model on multiple A100-80GB which means it's not the fault of quantification and code. I got to figure out the set of the train() and wish me luck. Thanks for help @FHL1998 |
Maybe you shouldn't modify the codes, but use |
Oh, my Goddess, it works. Thank you very much. I really appreciate it. You solved the problem I've been suffering from these days. |
Run on T5-flan-base, 8*A6000 48G server, can use CUDA_VISIABLE_DEVICES to launch by set (= 0 | 0,1 | 0,1,2), but when use more than 3 GPUs, the same "ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example |
Thanks @YanJiaHuan:
fixed the error for me on a 4 GPU setup. Full example: model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_4bit=True,
device_map={'':torch.cuda.current_device()}, # fix for >3 GPUs
) |
Actually, when running a single test, it worked, but not when run as part of a larger test suite. I guess |
Yes,
|
Little addition. I have tried all the suggested solutions but did not help me. It took one day to find it :
|
This is how I solved my issue:
Inside my Python script, I used these commands,
.................... My guess is the system is now considering GPU 3 as GPU 0 (default GPU) because of the "export CUDA_VISIBLE_DEVICES=3,4,5" command. Because, after the export command, I tried to see all the available GPUs and system gave me this output, Number of available GPUs: 3 |
I'm new to this field and want to try it out. I met this problem when trying to run the shell (finetune_guanaco_7b.sh) or "python qlora.py –learning_rate 0.0001 --model_name_or_path huggyllama/llama-7b" . Here's the information and need your help.
First, this is my environment: python 3.9.16, torch 2.0.1+cu118 , torchvision 0.15.2+cu118 , the requirements.txt
Second, I met the problem “ out of memory ” when running on multiple A100 with 80G and so I modified the code of qlora.py in line 267 and set the cuda from "auto" into a preset cuda "cuda:1". I thought it must be another program took the space and I set this program to a free GPU.
This error left. But it still turns to fail with another error and I barely know how to solve it.
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for example device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
I thought It might be here and I turned it into:
It returns the same.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/learner/anaconda3/envs/StockQlora/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
/home/learner/anaconda3/envs/StockQlora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/learner/anaconda3/envs/StockQlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary /home/learner/anaconda3/envs/StockQlora/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda121.so...
loading base model huggyllama/llama-7b...
/home/learner/anaconda3/envs/StockQlora/lib/python3.9/site-packages/transformers/modeling_utils.py:2192: FutureWarning: The
use_auth_token
argument is deprecated and will be removed in v5 of Transformers.warnings.warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00, 7.01s/it]
adding LoRA modules...
trainable params: 79953920.0 || all params: 3660320768 || trainable: 2.184341894267557
loaded model
Adding special tokens.
Downloading readme: 7.47kB [00:00, 3.14MB/s]
Downloading and preparing dataset parquet/tatsu-lab--alpaca to /home/learner/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24.2M/24.2M [00:00<00:00, 53.7MB/s]
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.23s/it]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1272.16it/s]
Dataset parquet downloaded and prepared to /home/learner/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 389.26it/s]
torch.float32 422326272 0.11537932153507864
torch.uint8 3238002688 0.8846206784649213
Traceback (most recent call last):
File "/home/learner/qlora/qlora.py", line 807, in
train()
File "/home/learner/qlora/qlora.py", line 769, in train
train_result = trainer.train()
File "/home/learner/anaconda3/envs/StockQlora/lib/python3.9/site-packages/transformers/trainer.py", line 1531, in train
return inner_training_loop(
File "/home/learner/anaconda3/envs/StockQlora/lib/python3.9/site-packages/transformers/trainer.py", line 1642, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/learner/anaconda3/envs/StockQlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1198, in prepare
result = tuple(
File "/home/learner/anaconda3/envs/StockQlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1199, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/learner/anaconda3/envs/StockQlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1026, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/learner/anaconda3/envs/StockQlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1277, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example
device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for example
device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}The text was updated successfully, but these errors were encountered: