Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting error while saving model #15301

Closed
avmodi opened this issue Jan 23, 2022 · 2 comments
Closed

Getting error while saving model #15301

avmodi opened this issue Jan 23, 2022 · 2 comments

Comments

@avmodi
Copy link

avmodi commented Jan 23, 2022

Environment info

  • transformers version: 4.5.1
  • Platform: linux
  • Python version : 3.6
  • PyTorch version (GPU?): gpu
  • Tensorflow version (GPU?):
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes
  • @sgugger Need help with trainer module

Models:

  • BERT

  • I am using BERT model. The problem arises in trainer module

#15300 File "~/lib/python3.6/site-packages/transformers/trainer.py", line 1608, in save_model
ShardedDDPOption.ZERO_DP_2 in self.args.sharded_ddp or ShardedDDPOption.ZERO_DP_3 in self.args.sharded_ddp
TypeError: 'in ' requires string as left operand, not ShardedDDPOption

I am training a Bert Model for a multi-class classification task

To reproduce

Steps to reproduce the behavior:

  1. Code
     import logging

import os
from statistics import mean, stdev
import sys
from typing import Callable, Dict
import pandas as pd

import numpy as np
from pprint import pformat
from scipy.special import softmax
import tensorboard
import torch

from transformers import (
AutoTokenizer,
AutoConfig,
HfArgumentParser,
Trainer,
EvalPrediction,
set_seed
)

from multimodal_args import ModelArguments, MultiModalDataArguments, MultiModalTrainingArguments
from evaluation import calc_classification_metrics, calc_regression_metrics
from load_dataset import load_datadir
from config import TabularConfig
from auto_fusion_model import AutoModelFusion
from utils import create_dir_if_not_exists

os.environ['COMET_MODE'] = 'DISABLED'
logger = logging.getLogger(name)

def main():

#Define text and tabular features
text_cols = ['keywords',"browse_node_name","pod","ORDERING_GL_PRODUCT_GROUP","gl_product_group_desc"]
label_col = 'label'
cat_features = []
non_num_col = text_cols + ["shipping_address_id","postal_code","browse_node_id","label","asin","customer_id","order_day"]
#features = pd.read_csv("/efs/avimodi/static_model/feature_importance_static_rhm.csv")
#features_list = features.head(50)["Feature"].to_list()
logger.info("Reading sample File")
sample = pd.read_csv("/efs/avimodi/.MultiModal_Model/input_sample/val.csv")
features_list = sample.columns.to_list()
num_features = [col for col in features_list if col not in non_num_col]
logger.info(len(num_features))
label_list = ["0","1","2"] # what each label class represents
column_info_dict = {
'text_cols': text_cols,
'num_cols': num_features,
'cat_cols': cat_features,
'label_col': 'label',
'label_list': ["0","1","2"]
}


model_args = ModelArguments(
    model_name_or_path='bert-base-uncased'
)

data_args = MultiModalDataArguments(
    data_path='/efs/avimodi/.MultiModal_Model/input_sample',
    fusion_method='attention',
    features_info=column_info_dict,
    task='classification',
    numerical_encoding='min-max',
    categorical_encoding = 'none'
)

training_args = MultiModalTrainingArguments(
    output_dir="/efs/avimodi/unified_model/run_sample/output",
    logging_dir="/efs/avimodi/unified_model/run_sample/logs",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=256,
    per_device_eval_batch_size=256,
    num_train_epochs=10,
    evaluate_during_training=True,
    logging_steps=25,
    eval_steps=500,
    save_steps=500,
    debug_dataset=True,
    report_to = ["tensorboard"],
)

set_seed(training_args.seed)

# Setup logging
create_dir_if_not_exists(training_args.output_dir)
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
    level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
    datefmt="%m/%d/%Y %H:%M:%S",
    filename = os.path.join(training_args.output_dir,'train_log.txt'),
    filemode = 'w+'
)




logger.info(f"======== Model Args ========\n{(model_args)}\n")
logger.info(f"======== Data Args ========\n{(data_args)}\n")
logger.info(f"======== Training Args ========\n{(training_args)}\n")


tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
)


train_dataset, val_dataset, test_dataset = load_datadir(
    data_args.data_path,
    data_args.features_info['text_cols'],
    tokenizer,
    label_col=data_args.features_info['label_col'],
    label_list=data_args.features_info['label_list'],
    categorical_cols=data_args.features_info['cat_cols'],
    numerical_cols=data_args.features_info['num_cols'],
    categorical_encoding=data_args.categorical_encoding,
    numerical_encoding=data_args.numerical_encoding,
    sep_text_token_str=tokenizer.sep_token,
    max_token_length=training_args.max_token_length,
    debug=training_args.debug_dataset
)
train_datasets = [train_dataset]
val_datasets = [val_dataset]
test_datasets = [test_dataset]

train_dataset = train_datasets[0]

num_labels = len(np.unique(train_dataset.labels)) if data_args.num_classes == -1 else data_args.num_classes

def compute_metrics_fn(p: EvalPrediction):
    if data_args.task == "classification":
        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
        preds_labels = np.argmax(preds, axis=1)
        if p.predictions.shape[-1] == 2:
            pred_scores = softmax(preds, axis=1)[:, 1]
        else:
            pred_scores = softmax(preds, axis=1)
        return calc_classification_metrics(pred_scores, preds_labels,
                                            p.label_ids)
    elif data_args.task == "regression":
        preds = np.squeeze(p.predictions)
        return calc_regression_metrics(preds, p.label_ids)
    else:
        return {}

total_results = []
for i, (train_dataset, val_dataset, test_dataset) in enumerate(zip(train_datasets, val_datasets, test_datasets)):
    logger.info(f'======== Fold {i+1} ========')
    config = AutoConfig.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
    )
    tabular_config = TabularConfig(
                           num_labels=num_labels,
                           cat_feat_dim=train_dataset.cat_feats.shape[1] if train_dataset.cat_feats is not None else 0,
                           numerical_feat_dim=train_dataset.numerical_feats.shape[1] if train_dataset.numerical_feats is not None else 0,
                           **vars(data_args)
                           )
    config.tabular_config = tabular_config

    model = AutoModelFusion.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        config=config,
        cache_dir=model_args.cache_dir
    )
    if i == 0:
        logger.info(tabular_config)
        logger.info(model)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics_fn
    )
    if training_args.do_train:
        train_result = trainer.train(
            resume_from_checkpoint=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
        )
        metrics = train_result.metrics
        # max_train_samples = (
        #     data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
        # )
        metrics["train_samples"] = 500 if training_args.debug_dataset else len(train_dataset)
        trainer.save_model()  # Saves the tokenizer too for easy upload
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()


    # Evaluation
    eval_results = {}
    if training_args.do_eval:
        logger.info("*** Evaluate ***")
        eval_result = trainer.evaluate(eval_dataset=val_dataset)
        logger.info(pformat(eval_result, indent=4))

        output_eval_file = os.path.join(
            training_args.output_dir, f"eval_metric_results_{task}_fold_{i+1}.txt"
        )
        if trainer.is_world_master():
            with open(output_eval_file, "w") as writer:
                logger.info("***** Eval results {} *****".format(task))
                for key, value in eval_result.items():
                    logger.info("  %s = %s", key, value)
                    writer.write("%s = %s\n" % (key, value))

        eval_results.update(eval_result)

    if training_args.do_predict:
        logging.info("*** Test ***")

        predictions = trainer.predict(test_dataset=test_dataset).predictions
        output_test_file = os.path.join(
            training_args.output_dir, f"test_results_{task}_fold_{i+1}.txt"
        )
        eval_result = trainer.evaluate(eval_dataset=test_dataset)
        logger.info(pformat(eval_result, indent=4))
        if trainer.is_world_master():
            with open(output_test_file, "w") as writer:
                logger.info("***** Test results {} *****".format(task))
                writer.write("index\tprediction\n")
                if task == "classification":
                    predictions = np.argmax(predictions, axis=1)
                for index, item in enumerate(predictions):
                    if task == "regression":
                        writer.write("%d\t%3.3f\t%d\n" % (index, item, test_dataset.labels[index]))
                    else:
                        item = test_dataset.get_labels()[item]
                        writer.write("%d\t%s\n" % (index, item))
            output_test_file = os.path.join(
                training_args.output_dir, f"test_metric_results_{task}_fold_{i+1}.txt"
            )
            with open(output_test_file, "w") as writer:
                logger.info("***** Test results {} *****".format(task))
                for key, value in eval_result.items():
                    logger.info("  %s = %s", key, value)
                    writer.write("%s = %s\n" % (key, value))
            eval_results.update(eval_result)
    del model
    del config
    del tabular_config
    del trainer
    torch.cuda.empty_cache()
    total_results.append(eval_results)
aggr_res = aggregate_results(total_results)
logger.info('========= Aggr Results ========')
logger.info(pformat(aggr_res, indent=4))

output_aggre_test_file = os.path.join(
    training_args.output_dir, f"all_test_metric_results_{task}.txt"
)
with open(output_aggre_test_file, "w") as writer:
    logger.info("***** Aggr results {} *****".format(task))
    for key, value in aggr_res.items():
        logger.info("  %s = %s", key, value)
        writer.write("%s = %s\n" % (key, value))

def aggregate_results(total_test_results):
metric_keys = list(total_test_results[0].keys())
aggr_results = dict()

for metric_name in metric_keys:
    if type(total_test_results[0][metric_name]) is str:
        continue
    res_list = []
    for results in total_test_results:
        res_list.append(results[metric_name])
    if len(res_list) == 1:
        metric_avg = res_list[0]
        metric_stdev = 0
    else:
        metric_avg = mean(res_list)
        metric_stdev = stdev(res_list)

    aggr_results[metric_name + '_mean'] = metric_avg
    aggr_results[metric_name + '_stdev'] = metric_stdev
return aggr_results

if name == 'main':
main()

2. Error
Traceback (most recent call last):
 File "run.py", line 289, in <module>
   main()
 File "run.py", line 191, in main
   trainer.save_model()  # Saves the tokenizer too for easy upload
 File "/home/avimodi/anaconda3/envs/chakanik_transformer/lib/python3.6/site-packages/transformers/trainer.py", line 1608, in save_model
   **ShardedDDPOption.ZERO_DP_2 in self.args.sharded_ddp or ShardedDDPOption.ZERO_DP_3 in self.args.sharded_ddp
TypeError: 'in <string>' requires string as left operand, not ShardedDDPOption**
Killing subprocess 122966
Traceback (most recent call last):
 File "/python3.6/runpy.py", line 193, in _run_module_as_main
   "__main__", mod_spec)
 File /lib/python3.6/runpy.py", line 85, in _run_code
   exec(code, run_globals)
 File "/home/avimodi/anaconda3/envs/chakanik_transformer/lib/python3.6/site-packages/torch/distributed/launch.py", line 340, in <module>
   main()
 File "/home/avimodi/anaconda3/envs/chakanik_transformer/lib/python3.6/site-packages/torch/distributed/launch.py", line 326, in main
   sigkill_handler(signal.SIGTERM, None)  # not coming back
 File "/lib/python3.6/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
   raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)

3.

<!-- If you have code snippets, error messages, stack traces please provide them here as well.
    Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
    Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->

## Expected behavior

<!-- A clear and concise description of what you would expect to happen. -->
@sgugger
Copy link
Collaborator

sgugger commented Jan 24, 2022

The error comes from the value you have in args.sharded_ddp where args is your MultiModalTrainingArguments object. Since you did not share the code of that class, there is little we can do to help fix the issue.

Also, please use the forums to debug your code as we keep the issues for bugs and feature requests only :-)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants