Getting error while saving model #15301

avmodi · 2022-01-23T17:22:50Z

Environment info

transformers version: 4.5.1
Platform: linux
Python version : 3.6
PyTorch version (GPU?): gpu
Tensorflow version (GPU?):
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes
@sgugger Need help with trainer module

Models:

BERT
I am using BERT model. The problem arises in trainer module

#15300 File "~/lib/python3.6/site-packages/transformers/trainer.py", line 1608, in save_model
ShardedDDPOption.ZERO_DP_2 in self.args.sharded_ddp or ShardedDDPOption.ZERO_DP_3 in self.args.sharded_ddp
TypeError: 'in ' requires string as left operand, not ShardedDDPOption

I am training a Bert Model for a multi-class classification task

To reproduce

Steps to reproduce the behavior:

Code
```
 import logging
```

import os
from statistics import mean, stdev
import sys
from typing import Callable, Dict
import pandas as pd

import numpy as np
from pprint import pformat
from scipy.special import softmax
import tensorboard
import torch

from transformers import (
AutoTokenizer,
AutoConfig,
HfArgumentParser,
Trainer,
EvalPrediction,
set_seed
)

from multimodal_args import ModelArguments, MultiModalDataArguments, MultiModalTrainingArguments
from evaluation import calc_classification_metrics, calc_regression_metrics
from load_dataset import load_datadir
from config import TabularConfig
from auto_fusion_model import AutoModelFusion
from utils import create_dir_if_not_exists

os.environ['COMET_MODE'] = 'DISABLED'
logger = logging.getLogger(name)

def main():

#Define text and tabular features
text_cols = ['keywords',"browse_node_name","pod","ORDERING_GL_PRODUCT_GROUP","gl_product_group_desc"]
label_col = 'label'
cat_features = []
non_num_col = text_cols + ["shipping_address_id","postal_code","browse_node_id","label","asin","customer_id","order_day"]
#features = pd.read_csv("/efs/avimodi/static_model/feature_importance_static_rhm.csv")
#features_list = features.head(50)["Feature"].to_list()
logger.info("Reading sample File")
sample = pd.read_csv("/efs/avimodi/.MultiModal_Model/input_sample/val.csv")
features_list = sample.columns.to_list()
num_features = [col for col in features_list if col not in non_num_col]
logger.info(len(num_features))
label_list = ["0","1","2"] # what each label class represents
column_info_dict = {
'text_cols': text_cols,
'num_cols': num_features,
'cat_cols': cat_features,
'label_col': 'label',
'label_list': ["0","1","2"]
}


model_args = ModelArguments(
    model_name_or_path='bert-base-uncased'
)

data_args = MultiModalDataArguments(
    data_path='/efs/avimodi/.MultiModal_Model/input_sample',
    fusion_method='attention',
    features_info=column_info_dict,
    task='classification',
    numerical_encoding='min-max',
    categorical_encoding = 'none'
)

training_args = MultiModalTrainingArguments(
    output_dir="/efs/avimodi/unified_model/run_sample/output",
    logging_dir="/efs/avimodi/unified_model/run_sample/logs",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=256,
    per_device_eval_batch_size=256,
    num_train_epochs=10,
    evaluate_during_training=True,
    logging_steps=25,
    eval_steps=500,
    save_steps=500,
    debug_dataset=True,
    report_to = ["tensorboard"],
)

set_seed(training_args.seed)

# Setup logging
create_dir_if_not_exists(training_args.output_dir)
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
    level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
    datefmt="%m/%d/%Y %H:%M:%S",
    filename = os.path.join(training_args.output_dir,'train_log.txt'),
    filemode = 'w+'
)




logger.info(f"======== Model Args ========\n{(model_args)}\n")
logger.info(f"======== Data Args ========\n{(data_args)}\n")
logger.info(f"======== Training Args ========\n{(training_args)}\n")


tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
)


train_dataset, val_dataset, test_dataset = load_datadir(
    data_args.data_path,
    data_args.features_info['text_cols'],
    tokenizer,
    label_col=data_args.features_info['label_col'],
    label_list=data_args.features_info['label_list'],
    categorical_cols=data_args.features_info['cat_cols'],
    numerical_cols=data_args.features_info['num_cols'],
    categorical_encoding=data_args.categorical_encoding,
    numerical_encoding=data_args.numerical_encoding,
    sep_text_token_str=tokenizer.sep_token,
    max_token_length=training_args.max_token_length,
    debug=training_args.debug_dataset
)
train_datasets = [train_dataset]
val_datasets = [val_dataset]
test_datasets = [test_dataset]

train_dataset = train_datasets[0]

num_labels = len(np.unique(train_dataset.labels)) if data_args.num_classes == -1 else data_args.num_classes

def compute_metrics_fn(p: EvalPrediction):
    if data_args.task == "classification":
        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
        preds_labels = np.argmax(preds, axis=1)
        if p.predictions.shape[-1] == 2:
            pred_scores = softmax(preds, axis=1)[:, 1]
        else:
            pred_scores = softmax(preds, axis=1)
        return calc_classification_metrics(pred_scores, preds_labels,
                                            p.label_ids)
    elif data_args.task == "regression":
        preds = np.squeeze(p.predictions)
        return calc_regression_metrics(preds, p.label_ids)
    else:
        return {}

total_results = []
for i, (train_dataset, val_dataset, test_dataset) in enumerate(zip(train_datasets, val_datasets, test_datasets)):
    logger.info(f'======== Fold {i+1} ========')
    config = AutoConfig.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
    )
    tabular_config = TabularConfig(
                           num_labels=num_labels,
                           cat_feat_dim=train_dataset.cat_feats.shape[1] if train_dataset.cat_feats is not None else 0,
                           numerical_feat_dim=train_dataset.numerical_feats.shape[1] if train_dataset.numerical_feats is not None else 0,
                           **vars(data_args)
                           )
    config.tabular_config = tabular_config

    model = AutoModelFusion.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        config=config,
        cache_dir=model_args.cache_dir
    )
    if i == 0:
        logger.info(tabular_config)
        logger.info(model)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics_fn
    )
    if training_args.do_train:
        train_result = trainer.train(
            resume_from_checkpoint=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
        )
        metrics = train_result.metrics
        # max_train_samples = (
        #     data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
        # )
        metrics["train_samples"] = 500 if training_args.debug_dataset else len(train_dataset)
        trainer.save_model()  # Saves the tokenizer too for easy upload
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()


    # Evaluation
    eval_results = {}
    if training_args.do_eval:
        logger.info("*** Evaluate ***")
        eval_result = trainer.evaluate(eval_dataset=val_dataset)
        logger.info(pformat(eval_result, indent=4))

        output_eval_file = os.path.join(
            training_args.output_dir, f"eval_metric_results_{task}_fold_{i+1}.txt"
        )
        if trainer.is_world_master():
            with open(output_eval_file, "w") as writer:
                logger.info("***** Eval results {} *****".format(task))
                for key, value in eval_result.items():
                    logger.info("  %s = %s", key, value)
                    writer.write("%s = %s\n" % (key, value))

        eval_results.update(eval_result)

    if training_args.do_predict:
        logging.info("*** Test ***")

        predictions = trainer.predict(test_dataset=test_dataset).predictions
        output_test_file = os.path.join(
            training_args.output_dir, f"test_results_{task}_fold_{i+1}.txt"
        )
        eval_result = trainer.evaluate(eval_dataset=test_dataset)
        logger.info(pformat(eval_result, indent=4))
        if trainer.is_world_master():
            with open(output_test_file, "w") as writer:
                logger.info("***** Test results {} *****".format(task))
                writer.write("index\tprediction\n")
                if task == "classification":
                    predictions = np.argmax(predictions, axis=1)
                for index, item in enumerate(predictions):
                    if task == "regression":
                        writer.write("%d\t%3.3f\t%d\n" % (index, item, test_dataset.labels[index]))
                    else:
                        item = test_dataset.get_labels()[item]
                        writer.write("%d\t%s\n" % (index, item))
            output_test_file = os.path.join(
                training_args.output_dir, f"test_metric_results_{task}_fold_{i+1}.txt"
            )
            with open(output_test_file, "w") as writer:
                logger.info("***** Test results {} *****".format(task))
                for key, value in eval_result.items():
                    logger.info("  %s = %s", key, value)
                    writer.write("%s = %s\n" % (key, value))
            eval_results.update(eval_result)
    del model
    del config
    del tabular_config
    del trainer
    torch.cuda.empty_cache()
    total_results.append(eval_results)
aggr_res = aggregate_results(total_results)
logger.info('========= Aggr Results ========')
logger.info(pformat(aggr_res, indent=4))

output_aggre_test_file = os.path.join(
    training_args.output_dir, f"all_test_metric_results_{task}.txt"
)
with open(output_aggre_test_file, "w") as writer:
    logger.info("***** Aggr results {} *****".format(task))
    for key, value in aggr_res.items():
        logger.info("  %s = %s", key, value)
        writer.write("%s = %s\n" % (key, value))

def aggregate_results(total_test_results):
metric_keys = list(total_test_results[0].keys())
aggr_results = dict()

for metric_name in metric_keys:
    if type(total_test_results[0][metric_name]) is str:
        continue
    res_list = []
    for results in total_test_results:
        res_list.append(results[metric_name])
    if len(res_list) == 1:
        metric_avg = res_list[0]
        metric_stdev = 0
    else:
        metric_avg = mean(res_list)
        metric_stdev = stdev(res_list)

    aggr_results[metric_name + '_mean'] = metric_avg
    aggr_results[metric_name + '_stdev'] = metric_stdev
return aggr_results

if name == 'main':
main()

2. Error
Traceback (most recent call last):
 File "run.py", line 289, in <module>
   main()
 File "run.py", line 191, in main
   trainer.save_model()  # Saves the tokenizer too for easy upload
 File "/home/avimodi/anaconda3/envs/chakanik_transformer/lib/python3.6/site-packages/transformers/trainer.py", line 1608, in save_model
   **ShardedDDPOption.ZERO_DP_2 in self.args.sharded_ddp or ShardedDDPOption.ZERO_DP_3 in self.args.sharded_ddp
TypeError: 'in <string>' requires string as left operand, not ShardedDDPOption**
Killing subprocess 122966
Traceback (most recent call last):
 File "/python3.6/runpy.py", line 193, in _run_module_as_main
   "__main__", mod_spec)
 File /lib/python3.6/runpy.py", line 85, in _run_code
   exec(code, run_globals)
 File "/home/avimodi/anaconda3/envs/chakanik_transformer/lib/python3.6/site-packages/torch/distributed/launch.py", line 340, in <module>
   main()
 File "/home/avimodi/anaconda3/envs/chakanik_transformer/lib/python3.6/site-packages/torch/distributed/launch.py", line 326, in main
   sigkill_handler(signal.SIGTERM, None)  # not coming back
 File "/lib/python3.6/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
   raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)

3.

<!-- If you have code snippets, error messages, stack traces please provide them here as well.
    Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
    Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->

## Expected behavior

<!-- A clear and concise description of what you would expect to happen. -->

The text was updated successfully, but these errors were encountered:

sgugger · 2022-01-24T12:39:43Z

The error comes from the value you have in args.sharded_ddp where args is your MultiModalTrainingArguments object. Since you did not share the code of that class, there is little we can do to help fix the issue.

Also, please use the forums to debug your code as we keep the issues for bugs and feature requests only :-)

github-actions · 2022-02-23T15:07:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting error while saving model #15301

Getting error while saving model #15301

avmodi commented Jan 23, 2022 •

edited

Loading

sgugger commented Jan 24, 2022

github-actions bot commented Feb 23, 2022

Getting error while saving model #15301

Getting error while saving model #15301

Comments

avmodi commented Jan 23, 2022 • edited Loading

Environment info

To reproduce

sgugger commented Jan 24, 2022

github-actions bot commented Feb 23, 2022

avmodi commented Jan 23, 2022 •

edited

Loading