-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer
class on Mac uses accelerate
to incorrectly set MPS device
#24697
Comments
EDIT: Adding the flag I suggest making it something like |
I am not really an expert on this topic, but do you think #24660 will help? |
If not, a reproducible script is indeed necessary, please 🙏 |
I have a similar issue as the Trainer was automatically using the MPS backend and couldn't figure out a way of running on CPU. (The MPS backend is missing some operations, so no all models runs!). |
cc @SunMarc Maybe we could deprecate the |
Yes, we should do that since we will automatically set the device to |
@SunMarc accelerate does this automatically in its dataloader/with the Accelerator, so this should be already happening. If not, it's something we need to fix in accelerate |
There is also another issue that the default device is from transformers import AutoTokenizer
from datasets import load_dataset
from transformers import AutoModelForCausalLM
from transformers import Trainer, TrainingArguments
model_checkpoint = "roneneldan/TinyStories-33M"
ds = load_dataset('MohamedRashad/characters_backstories')["train"]
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(example):
merged = example["text"] + " " + example["target"]
batch = tokenizer(merged, padding='max_length', truncation=True, max_length=128)
batch["labels"] = batch["input_ids"].copy()
return batch
tokenized_dataset = ds.map(tokenize_function, remove_columns=["text", "target"])
model = AutoModelForCausalLM.from_pretrained(model_checkpoint);
training_args = TrainingArguments(
num_train_epochs=1,
output_dir=".",
# use_mps_device=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)
print(trainer.accelerator.device)
# device("mps")
# Let's train!
trainer.train() You can solve the issue by explicitly using PD: I am on latest of |
Hey @tcapelle , thanks for the snippet. It helps a lot to solve the issue. I was able to reproduce the bug on the latest version of |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers==4.30.2
Mac 2019, Ventura 13.4
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
ISSUE: I am running a generic model training using Trainer on my mac, locally. My model is being moved to MPS, but my tensors are staying on CPU.
I can provide more details about my script, but I kinda expect that this is a general library problem. Here's the lines of code I discovered:
When the accelerator is instantiated in the Trainer class, it doesn't get passed any user-specific arguments, like this from TrainingArgs for e.g to give the user control over which device to use. As a result, when running locally on Mac, Accelerate does a lot of inference about which device we want to use, and moves the model to
self.device
in the non-distributed setting. I'm not sure yet howself.device
is instantiated in Accelerate, however,Trainer
doesn't natively move my data tomps
, so my script is crashing.Expected behavior
Ideally, I have a flag I can pass into
Trainer
to help me not MPS if I don't want to, and just stick with CPU.The text was updated successfully, but these errors were encountered: