Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformer.Trainer fails in creating optimizer for optim adamw_torch_fused when launched with deepspeed. #31867

Open
3 of 4 tasks
princethewinner opened this issue Jul 9, 2024 · 7 comments

Comments

@princethewinner
Copy link

princethewinner commented Jul 9, 2024

System Info

  • transformers version: 4.42.3
  • Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.31.0
  • deepspeed version: 0.14.4
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: Yes
  • Using GPU in script?: Yes
  • GPU type: Tesla V100-SXM2-32GB

Who can help?

@muellerzr

The issue arises when the script is launched with deepspeed. It seems that the model is not loaded in GPU when create_optimizer is called and thus fails in creating an optimizer.

Launch command

deepspeed --num_gpus=2 trainer_adamw_fused_test.py 

Output:
image

However, setting deepspeed_dict=None and using the same launch command does not cause any error, and training continues as usual. So, I am guessing it could be caused by conflicting deepspeed settings or incorrect parsing of deepspeed settings.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import numpy as np
import evaluate

from loguru import logger


class CustomTrainer(Trainer):
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
    
    def create_optimizer(self):
        logger.debug("Named parameters [{}]", [b.device.type for a, b in self.model.named_parameters()])
        return super().create_optimizer()

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

deepspeed_dict = {
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "fp16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 1
    }
}

training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch", optim="adamw_torch_fused", deepspeed=deepspeed_dict)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Expected behavior

Training should be completed.

@amyeroberts
Copy link
Collaborator

cc @muellerzr @SunMarc

@VladOS95-cyber
Copy link
Contributor

VladOS95-cyber commented Sep 19, 2024

Hi @amyeroberts! I'll take a look on this issue

@VladOS95-cyber
Copy link
Contributor

Hi @amyeroberts! I'll take a look on this issue

Hello, I faced with some issues installing deepseed and I am not sure how fast and efficient I will be able to resolve it. I think it would be better if someone else takes this task.

@Ben-Schneider-code
Copy link
Contributor

I guess I'll take a crack at this one.

@Ben-Schneider-code
Copy link
Contributor

hi @princethewinner this seems to be caused by a versioning issue between pytorch and deepspeed. This can be resolved by rolling pytorch forward from 2.2.1 -> 2.4 (make sure you uninstall and reinstall deepspeed too because it is conditioned on your torch install). Alternatively an older verion of deepspeed might work too, but I didn't experiment with that.

@ArthurZucker I think this issue can be closed, I don't think there is anything transformers related here.

@SunMarc
Copy link
Member

SunMarc commented Oct 18, 2024

Please let us know @princethewinner if the issue is fixed by upgrading torch !

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants