`Trainer` class on Mac uses `accelerate` to incorrectly set MPS device #24697

alex2awesome · 2023-07-06T19:22:06Z

System Info

transformers==4.30.2
Mac 2019, Ventura 13.4

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

ISSUE: I am running a generic model training using Trainer on my mac, locally. My model is being moved to MPS, but my tensors are staying on CPU.

I can provide more details about my script, but I kinda expect that this is a general library problem. Here's the lines of code I discovered:

When the accelerator is instantiated in the Trainer class, it doesn't get passed any user-specific arguments, like this from TrainingArgs for e.g to give the user control over which device to use. As a result, when running locally on Mac, Accelerate does a lot of inference about which device we want to use, and moves the model to self.device in the non-distributed setting. I'm not sure yet how self.device is instantiated in Accelerate, however, Trainer doesn't natively move my data to mps, so my script is crashing.

Expected behavior

Ideally, I have a flag I can pass into Trainer to help me not MPS if I don't want to, and just stick with CPU.

The text was updated successfully, but these errors were encountered:

alex2awesome · 2023-07-06T19:38:11Z

EDIT:

Adding the flag --no_cuda in TrainingArgs takes care of this issue.

I suggest making it something like --use_cpu or --no_cuda_or_mps, because i totally didn't realize it could be used for this purpose and had to dive to the very bottom of the code-base to see.

ydshieh · 2023-07-07T08:24:24Z

I am not really an expert on this topic, but do you think #24660 will help?

ydshieh · 2023-07-07T08:24:57Z

If not, a reproducible script is indeed necessary, please 🙏

tcapelle · 2023-07-17T14:20:34Z

I have a similar issue as the Trainer was automatically using the MPS backend and couldn't figure out a way of running on CPU. (The MPS backend is missing some operations, so no all models runs!).
Using no_cuda=True in the TrainerArgs solved the issue! pretty unintuitive!

sgugger · 2023-07-17T15:21:09Z

cc @SunMarc Maybe we could deprecate the no_cuda flag to replace it with use_cpu, which would be more intuitive?

SunMarc · 2023-07-17T15:58:58Z

Yes, we should do that since we will automatically set the device to cuda or mps if available. Furthermore, use_mps_device in TrainingArgs is also deprecated. I will open a PR for that. The other issue is that we don't dispatch the data in the right device. @muellerzr, I see that we don't move the dataloader to a specific device in get_train_dataloader. Is this something we want to add ? I can open a PR for it if needed.

muellerzr · 2023-07-17T16:01:54Z

@SunMarc accelerate does this automatically in its dataloader/with the Accelerator, so this should be already happening. If not, it's something we need to fix in accelerate

tcapelle · 2023-07-17T19:35:38Z

There is also another issue that the default device is mps but the data is not moved to mps, so the Trainer raises an error, minimal code:

from transformers import AutoTokenizer
from datasets import load_dataset
from transformers import AutoModelForCausalLM
from transformers import Trainer, TrainingArguments

model_checkpoint = "roneneldan/TinyStories-33M"
ds = load_dataset('MohamedRashad/characters_backstories')["train"]

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(example):
    merged = example["text"] + " " + example["target"]
    batch = tokenizer(merged, padding='max_length', truncation=True, max_length=128)
    batch["labels"] = batch["input_ids"].copy()
    return batch

tokenized_dataset = ds.map(tokenize_function, remove_columns=["text", "target"])

model = AutoModelForCausalLM.from_pretrained(model_checkpoint);

training_args = TrainingArguments(
    num_train_epochs=1,
    output_dir=".",
    # use_mps_device=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

print(trainer.accelerator.device)
# device("mps")

# Let's train!
trainer.train()

You can solve the issue by explicitly using use_mps_device=True or no_cuda=True on the TrainingArgs

PD: I am on latest of transformers, datasets and accelerate (pip install -U ....)

SunMarc · 2023-07-17T20:41:17Z

Hey @tcapelle , thanks for the snippet. It helps a lot to solve the issue. I was able to reproduce the bug on the latest version of transformers. This bug is fixed on the main branch of transformers that you can download with pip install https://github.com/huggingface/transformers.git. Let me know if it works on your side.

github-actions · 2023-08-11T15:02:20Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

SunMarc mentioned this issue Jul 17, 2023

deprecate no_cuda #24863

Merged

github-actions bot closed this as completed Aug 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Trainer` class on Mac uses `accelerate` to incorrectly set MPS device #24697

`Trainer` class on Mac uses `accelerate` to incorrectly set MPS device #24697

alex2awesome commented Jul 6, 2023 •

edited

Loading

alex2awesome commented Jul 6, 2023

ydshieh commented Jul 7, 2023

ydshieh commented Jul 7, 2023

tcapelle commented Jul 17, 2023

sgugger commented Jul 17, 2023

SunMarc commented Jul 17, 2023 •

edited

Loading

muellerzr commented Jul 17, 2023 •

edited

Loading

tcapelle commented Jul 17, 2023 •

edited

Loading

SunMarc commented Jul 17, 2023

github-actions bot commented Aug 11, 2023

Trainer class on Mac uses accelerate to incorrectly set MPS device #24697

Trainer class on Mac uses accelerate to incorrectly set MPS device #24697

Comments

alex2awesome commented Jul 6, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

alex2awesome commented Jul 6, 2023

ydshieh commented Jul 7, 2023

ydshieh commented Jul 7, 2023

tcapelle commented Jul 17, 2023

sgugger commented Jul 17, 2023

SunMarc commented Jul 17, 2023 • edited Loading

muellerzr commented Jul 17, 2023 • edited Loading

tcapelle commented Jul 17, 2023 • edited Loading

SunMarc commented Jul 17, 2023

github-actions bot commented Aug 11, 2023

`Trainer` class on Mac uses `accelerate` to incorrectly set MPS device #24697

`Trainer` class on Mac uses `accelerate` to incorrectly set MPS device #24697

alex2awesome commented Jul 6, 2023 •

edited

Loading

SunMarc commented Jul 17, 2023 •

edited

Loading

muellerzr commented Jul 17, 2023 •

edited

Loading

tcapelle commented Jul 17, 2023 •

edited

Loading