-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
finetuning with PEFT int-8bit + LoRA on single node multiGPU was working, now doesn't any more #1840
Comments
Using |
Thanks for your reply! However, my code was working with Can you help me figure out why it does not work anymore?
involved would I have gotten, if I had been installing from source on, say, July 30? Again, the code used to work. Maybe it was not doing data parallelism correct, but at least I had resolved the OOM errors. Trying to run the same notebook on a single GPU gets OOM immediately, meaning that the code was indeed leveraging the multiple GPUs somehow. |
Could you please give us the full traceback so we can understand what is going on? |
Apologies for the delay. Sure. The setup (model download and quantization, LoRa layers, peft adapters preparation) is exactly as in my initial message. The notebook cell
errors out with traceback
with the following library versions
So I restart from the beginning, update This time the trainer instantiation (exactly as before) succeeds, with cell output
(probably due to my high debug level? not sure) but the next notebook cell
fails immediately, with traceback
Expected behavior: running the cell
actually finetunes the model, running the required number of steps as intended, without running OOM (that's what used to happen until late July). |
cc @younesbelkada and @pacman100 since it's a PEFT model. |
I'm facing a very similar issue also in Databricks. On a single-node
TracebackValueError Traceback (most recent call last)
File <command-91325968453>, line 6
4 except Exception as e:
5 mlflow.end_run()
----> 6 raise e
File <command-91325968453>, line 3
1 # Train the model
2 try:
----> 3 trainer.train()
4 except Exception as e:
5 mlflow.end_run()
File /databricks/python/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py:432, in safe_patch.<locals>.safe_patch_function(*args, **kwargs)
417 if (
418 active_session_failed
419 or autologging_is_disabled(autologging_integration)
(...)
426 # warning behavior during original function execution, since autologging is being
427 # skipped
428 with set_non_mlflow_warnings_behavior_for_current_thread(
429 disable_warnings=False,
430 reroute_warnings=False,
431 ):
--> 432 return original(*args, **kwargs)
434 # Whether or not the original / underlying function has been called during the
435 # execution of patched code
436 original_has_been_called = False
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/trainer.py:1539, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1534 self.model_wrapped = self.model
1536 inner_training_loop = find_executable_batch_size(
1537 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1538 )
-> 1539 return inner_training_loop(
1540 args=args,
1541 resume_from_checkpoint=resume_from_checkpoint,
1542 trial=trial,
1543 ignore_keys_for_eval=ignore_keys_for_eval,
1544 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/trainer.py:1656, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1654 model = self.accelerator.prepare(self.model)
1655 else:
-> 1656 model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
1657 else:
1658 # to handle cases wherein we pass "DummyScheduler" such as when it is specified in DeepSpeed config.
1659 model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
1660 self.model, self.optimizer, self.lr_scheduler
1661 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/accelerate/accelerator.py:1202, in Accelerator.prepare(self, device_placement, *args)
1200 result = self._prepare_megatron_lm(*args)
1201 else:
-> 1202 result = tuple(
1203 self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
1204 )
1205 result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(result, device_placement))
1207 if tpu_should_fix_optimizer or self.mixed_precision == "fp8":
1208 # 2. grabbing new model parameters
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/accelerate/accelerator.py:1203, in <genexpr>(.0)
1200 result = self._prepare_megatron_lm(*args)
1201 else:
1202 result = tuple(
-> 1203 self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
1204 )
1205 result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(result, device_placement))
1207 if tpu_should_fix_optimizer or self.mixed_precision == "fp8":
1208 # 2. grabbing new model parameters
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/accelerate/accelerator.py:1030, in Accelerator._prepare_one(self, obj, first_pass, device_placement)
1028 return self.prepare_data_loader(obj, device_placement=device_placement)
1029 elif isinstance(obj, torch.nn.Module):
-> 1030 return self.prepare_model(obj, device_placement=device_placement)
1031 elif isinstance(obj, torch.optim.Optimizer):
1032 optimizer = self.prepare_optimizer(obj, device_placement=device_placement)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/accelerate/accelerator.py:1270, in Accelerator.prepare_model(self, model, device_placement, evaluation_mode)
1268 model_devices = set(model.hf_device_map.values())
1269 if len(model_devices) > 1 and self.distributed_type != DistributedType.NO:
-> 1270 raise ValueError(
1271 "You can't train a model that has been loaded in 8-bit precision on multiple devices in any distributed mode."
1272 " In order to use 8-bit models that have been loaded across multiple GPUs the solution is to use Naive Pipeline Parallelism."
1273 " Therefore you should not specify that you are under any distributed regime in your accelerate config."
1274 )
1275 current_device = list(model_devices)[0]
1276 current_device_index = current_device.index if isinstance(current_device, torch.device) else current_device
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices in any distributed mode. In order to use 8-bit models that have been loaded across multiple GPUs the solution is to use Naive Pipeline Parallelism. Therefore you should not specify that you are under any distributed regime in your accelerate config. Script Snippet:# Load model directly
import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer
# Set up BNB config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Get tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Get LLM model
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
use_cache=False
)
# Set up Lora and peft
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
def format_instruction(sample):
...
# Get datasets
train_dataset, test_dataset = ...
# Set up training
args = TrainingArguments(
output_dir=MODEL_ID.replace("/", "-"),
num_train_epochs=NUM_EPOCHS,
per_device_train_batch_size=1,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
optim="paged_adamw_32bit",
logging_steps=10,
save_strategy="epoch",
learning_rate=2e-4,
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type="constant",
disable_tqdm=False, # disable tqdm since with packing values are in correct
ddp_find_unused_parameters=False,
)
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
peft_config=peft_config,
max_seq_length=MAX_SEQ_LENGTH,
packing=True,
tokenizer=tokenizer,
formatting_func=format_instruction,
args=args,
)
trainer.train() Dependencies
DebuggingI tried removing my
Hope this helps! |
Hi there, thanks all for the ping, let me try to answer that question to the best way possible. Firstly, I am very surprised when you say that DDP + multi-GPU + The rootcause of this issue is that you are using Hence, two scenarios that are left for us now, depending on the initial training setup: Run the training setup with Naive PP (Naive Pipeline Parallelism)If the model does not fit entirely into a single GPU, you can continue using Use DDP (if the model fits a single GPU)DDP + quantized models should work if and only if the training setup (meaning model weights, gradients + intermediate hidden states) can entirely fit a single GPU (which I assume is the case since you said:
You need a hack so that each working process will load the entire model on the correct GPU. Simply replacing Solutionfrom accelerate import Accelerator
device_index = Accelerator().process_index
device_map = {"": device_index}
...
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map=device_map
...
) then run your script with This is what we do in TRL library precisely here and seems to work fine so far, therefore I am sure it should fix the issue for both of you. Hope that helps |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I am struggling with the same issue on Databricks as well. Have you solved your problem? |
try this, it worked for me:
|
This is a super helpful response, wondering how this changes with the QLora + FSDP implementation? Launching with accelerate and FSDP, it still appears as though the entire initialization process happens on 1 GPU and if it the quantized model doesn't fit it errors out due to OOM. |
I have been experimenting with finetuning the mpt-7b-instruct model on a private dataset. I am developing in databricks notebooks.
This was my setup:
cluster: single driver node with g5.12xlarge
bits&bytes config:
model config:
lora config (here
weight_query_key_modules = [key for key, _ in model.named_modules() if 'Wqkv' in key]
is a list of attention layers):peft+lora model preparation:
skipping the dataset preparation, but ultimately i get the usual
input_id
s andattention_mask
tensors.training:
Up until ~ 2 weeks ago, this used to work. Now it doesn't work anymore.
First, the Trainer initialization crashes and hints that I should upgrade to
bitsandbytes==0.41.1
- looks like this error.So I update, but now
trainer.train()
crashes withwhich I do not understand because I am not running any script, I did not use
device_map = 'auto'
(ok, the result is the same, my model is distributed, but still not the clearest error), and generally that did not use to be a problem!Then I tried to manually avoid the error, by setting
peft_model._is_quantized_training_enabled = True
before Trainer initialization.Then Trainer initializes correctly, although I get the additional warning
which I do not follow - I did not touch the
save_safetensors
option, so why the additional warning?That works, but then
trainer.train()
crashes with the same errorI tried other combinations of versions - downgrade accelerate to 0.21.0, downgrade transformers to 4.30.0 or 4.31.0... nothing seems to be working (honestly, I do not know if I tried all the possible combinations). I mostly get the same ValueError as before, although sometimes I get the following error:
from here which again does not make much sense since I am not passing any accelerate config, and since the same code was working up to a couple weeks ago.
What changed?
Unfortunately I do not know exactly what package versions I had installed two weeks ago. My installation setup was:
obviously pretty redundant to pin versions of transformers and then reinstall it, but I didn't need to clean it up until now.
Now, the same installation doesn't work anymore.
I looked at similar issues (this and this for example), but was unable to find a solution for my problem. It seems both issues have been fixed by a PR, but even when installing these libraries from source, my code doesn't work anymore.
Can someone help me here? There must be some configurations of the many libraries involved (torch, transformers, accelerate, peft, bitsandbytes...) that works for what I am trying to do.
@younesbelkada tagging you since you seem to have helped a lot of people here :) and I'm hoping you can add me to the list.
The text was updated successfully, but these errors were encountered: