Transformer.Trainer fails in creating optimizer for optim adamw_torch_fused when launched with deepspeed. #31867

princethewinner · 2024-07-09T15:16:25Z

System Info

transformers version: 4.42.3
Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.31.0
deepspeed version: 0.14.4
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: Yes
Using GPU in script?: Yes
GPU type: Tesla V100-SXM2-32GB

Who can help?

The issue arises when the script is launched with deepspeed. It seems that the model is not loaded in GPU when create_optimizer is called and thus fails in creating an optimizer.

Launch command

deepspeed --num_gpus=2 trainer_adamw_fused_test.py

Output:

However, setting deepspeed_dict=None and using the same launch command does not cause any error, and training continues as usual. So, I am guessing it could be caused by conflicting deepspeed settings or incorrect parsing of deepspeed settings.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import numpy as np
import evaluate

from loguru import logger


class CustomTrainer(Trainer):
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
    
    def create_optimizer(self):
        logger.debug("Named parameters [{}]", [b.device.type for a, b in self.model.named_parameters()])
        return super().create_optimizer()

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

deepspeed_dict = {
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "fp16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 1
    }
}

training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch", optim="adamw_torch_fused", deepspeed=deepspeed_dict)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Expected behavior

Training should be completed.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-08-12T10:20:15Z

cc @muellerzr @SunMarc

VladOS95-cyber · 2024-09-19T08:07:40Z

Hi @amyeroberts! I'll take a look on this issue

VladOS95-cyber · 2024-09-25T16:22:12Z

Hi @amyeroberts! I'll take a look on this issue

Hello, I faced with some issues installing deepseed and I am not sure how fast and efficient I will be able to resolve it. I think it would be better if someone else takes this task.

Ben-Schneider-code · 2024-10-14T01:28:15Z

I guess I'll take a crack at this one.

Ben-Schneider-code · 2024-10-17T22:12:17Z

hi @princethewinner this seems to be caused by a versioning issue between pytorch and deepspeed. This can be resolved by rolling pytorch forward from 2.2.1 -> 2.4 (make sure you uninstall and reinstall deepspeed too because it is conditioned on your torch install). Alternatively an older verion of deepspeed might work too, but I didn't experiment with that.

@ArthurZucker I think this issue can be closed, I don't think there is anything transformers related here.

SunMarc · 2024-10-18T15:13:06Z

Please let us know @princethewinner if the issue is fixed by upgrading torch !

github-actions · 2024-11-12T08:10:33Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts added trainer DeepSpeed labels Jul 9, 2024

huggingface deleted a comment from github-actions bot Aug 12, 2024

huggingface deleted a comment from github-actions bot Sep 13, 2024

amyeroberts mentioned this issue Sep 13, 2024

Accelerate x Trainer issue tracker: #33345

Open

43 tasks

Ben-Schneider-code mentioned this issue Oct 14, 2024

Request more specific info from bug reporters when opening deepspeed issues #34145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer.Trainer fails in creating optimizer for optim adamw_torch_fused when launched with deepspeed. #31867

Transformer.Trainer fails in creating optimizer for optim adamw_torch_fused when launched with deepspeed. #31867

princethewinner commented Jul 9, 2024 •

edited

Loading

amyeroberts commented Aug 12, 2024

VladOS95-cyber commented Sep 19, 2024 •

edited

Loading

VladOS95-cyber commented Sep 25, 2024

Ben-Schneider-code commented Oct 14, 2024

Ben-Schneider-code commented Oct 17, 2024

SunMarc commented Oct 18, 2024

github-actions bot commented Nov 12, 2024

Transformer.Trainer fails in creating optimizer for optim adamw_torch_fused when launched with deepspeed. #31867

Transformer.Trainer fails in creating optimizer for optim adamw_torch_fused when launched with deepspeed. #31867

Comments

princethewinner commented Jul 9, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Aug 12, 2024

VladOS95-cyber commented Sep 19, 2024 • edited Loading

VladOS95-cyber commented Sep 25, 2024

Ben-Schneider-code commented Oct 14, 2024

Ben-Schneider-code commented Oct 17, 2024

SunMarc commented Oct 18, 2024

github-actions bot commented Nov 12, 2024

princethewinner commented Jul 9, 2024 •

edited

Loading

VladOS95-cyber commented Sep 19, 2024 •

edited

Loading