[trainer] seq2seq doesn't handle mt5 correctly #9865

mxa4646 · 2021-01-28T07:26:55Z

Environment info

transformers version: 4.2.2
Platform: Linux-5.4.0-58-generic-x86_64-with-debian-buster-sid
Python version: 3.7.7
PyTorch version (GPU?): 1.7.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

@stas00,@patrickvonplaten, @patil-suraj

Information

Model I am using (MT5-xl,MT5-large):

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (official example scripts task)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

The script I used is exmaples/seq2seq/finetune_trainer.py, which was originally used to reproduce the training of T5-3b on single 3090. All processes are the same as #8771 and it can reproduce the training of T5-3b(whether single card or 2/4 cards).
Here is the problem, when I try to train MT5-xl, --freeze_embeds seems to bring bugs. I used 4*3090, My script is

export BS=1; PYTHONPATH=../../src; USE_TF=0;
/usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16

Here is my report:

[2021-01-27 14:59:52,982] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-27 14:59:57,024] [INFO] [runner.py:358:main] cmd = /<my_dir>/miniconda3/envs/nlp/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
[2021-01-27 14:59:57,793] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-01-27 14:59:57,793] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-01-27 14:59:57,793] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2021-01-27 14:59:57,793] [INFO] [launch.py:100:main] dist_world_size=4
[2021-01-27 14:59:57,793] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-01-27 15:00:01,106] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,340] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,672] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,870] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=5, logging_dir='runs/Jan27_15-00-01_user-SYS-4029GP-TRT', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=25000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed='ds_config.json', label_smoothing_factor=0.1, adafactor=False, sortish_sampler=True, predict_with_generate=True)
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:443] 2021-01-27 15:00:05,352 >> loading configuration file /<my_model_dir>/models/mt5/xl/v0/config.json
[INFO|configuration_utils.py:481] 2021-01-27 15:00:05,353 >> Model config MT5Config {
  "_name_or_path": "/home/patrick/t5/mt5-xl",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 5120,
  "d_kv": 64,
  "d_model": 2048,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 24,
  "num_heads": 32,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.2.1",
  "use_cache": true,
  "vocab_size": 250112
}

[INFO|configuration_utils.py:443] 2021-01-27 15:00:05,353 >> loading configuration file /<my_model_dir>/models/mt5/xl/v0/config.json
[INFO|configuration_utils.py:481] 2021-01-27 15:00:05,354 >> Model config MT5Config {
  "_name_or_path": "/home/patrick/t5/mt5-xl",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 5120,
  "d_kv": 64,
  "d_model": 2048,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 24,
  "num_heads": 32,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.2.1",
  "use_cache": true,
  "vocab_size": 250112
}

[INFO|tokenization_utils_base.py:1685] 2021-01-27 15:00:05,354 >> Model name '/<my_model_dir>/models/mt5/xl/v0' not found in model shortcut name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). Assuming '/<my_model_dir>/models/mt5/xl/v0' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,354 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/special_tokens_map.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/tokenizer_config.json. We won't load it.
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file /<my_model_dir>/models/mt5/xl/v0/spiece.model
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|modeling_utils.py:1025] 2021-01-27 15:00:06,472 >> loading weights file /<my_model_dir>/models/mt5/xl/v0/pytorch_model.bin
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
[INFO|modeling_utils.py:1143] 2021-01-27 15:05:03,683 >> All model checkpoint weights were used when initializing MT5ForConditionalGeneration.

[INFO|modeling_utils.py:1152] 2021-01-27 15:05:03,683 >> All the weights of MT5ForConditionalGeneration were initialized from the model checkpoint at /<my_model_dir>/models/mt5/xl/v0.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MT5ForConditionalGeneration for predictions without further training.
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
	Command being timed: "deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16"
	User time (seconds): 348.34
	System time (seconds): 177.55
	Percent of CPU this job got: 166%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 5:15.88
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 33558800
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1
	Minor (reclaiming a frame) page faults: 67111048
	Voluntary context switches: 132337
	Involuntary context switches: 6635761
	Swaps: 0
	File system inputs: 29248712
	File system outputs: 32
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

So I removed --freeze_embeds and tried to train MT5-xl again, but I got CUDA out of memory. My device is 4*24G 3090, with BS=1, ZeRO stage=2, and CPU_offload=true. I assume that T5-3b and MT5-xl should be in the same order of magnitude, and I can do it on t5-3b, so I think this should not happen.
I also tried training MT5-large. Just replace mt5-xl to mt5-large, under the same conditions in 3. And I got the overflow problem. This is not surprising me because MT5-large seems not fixed FP16 yet. In short, I want to know if there is any problem with my operation or if this is the case. If it is because the MT5-large has not been repaired, does huggingface have any plans to repair it?

Expected behavior

Why can't mt5-xl train on 4*3090? Or what should I do?
Can mt5-large FP16 (mainly DeepSpeed) be used? If not, is there any plan to fix it?

The text was updated successfully, but these errors were encountered:

stas00 · 2021-01-28T18:10:35Z

OK, I can reproduce the problem with just google/mt5-small and 2 gpus:

export BS=1; PYTHONPATH=../../src USE_TF=0 deepspeed --num_gpus=2 ./finetune_trainer.py --model_name_or_path google/mt5-small --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16

We will get it sorted out today.

stas00 · 2021-01-28T19:08:15Z

ok, the problem had nothing to do with DeepSpeed, it's just a seq2seq neglect.

The fix is:

diff --git a/examples/seq2seq/utils.py b/examples/seq2seq/utils.py
index 8b24bfda..303b89f7 100644
--- a/examples/seq2seq/utils.py
+++ b/examples/seq2seq/utils.py
@@ -563,7 +563,7 @@ def freeze_embeds(model):
     """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
     model_type = model.config.model_type

-    if model_type == "t5":
+    if model_type in ["t5", "mt5"]:
         freeze_params(model.shared)
         for d in [model.encoder, model.decoder]:
             freeze_params(d.embed_tokens)

Please let me know if you can manage to apply this fix. I will make a proper PR later, but it'll take some work, since I need to make a tiny mt5 model and add a test.

You can just edit the file if you don't know how to apply a patch.

stas00 · 2021-01-29T00:06:07Z

The fix should be merged shortly #9879

mxa4646 · 2021-01-29T05:02:29Z

I can solve the --freeze_embeds bug now, thanks for your help! @stas00

As for questions 3 and 4, I noticed that the title of the issue has been edited. I don't know if these questions are caused by the model or the seq2seq trainer. Maybe I should raise them in a new issue?

stas00 · 2021-01-29T05:27:35Z

Oh, you wrote those items as steps to reproduce the problem, so I didn't know that those were issues that needed to/could be fixed.

Once I discovered that the issue you posted was unrelated to DeepSpeed I took the liberty to adjust the subject.

In general, yes, let's try to keep each issue separate, so that it makes it much easier to track things and not let things fall between the cracks.

Back to your follow up question:

Looking just at the params:

t5-3b ~10GB
mt5-xl ~15GB

So the 2nd model is substantially larger, and if t5-3b fit tightly onto a 24GB card it's not surprising that the larger model didn't.

and in addition to model params you also need to allocate memory for:

inputs
gradients
optimizer states

I tried mt5-xl on 4x 40gb gpu setup and it worked, but took ~29GB on each GPU, so there is the problem - you're 5GB short.

The command I run was:

export BS=1; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path google/mt5-xl --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16

You may try to tweak the buffer sizes in ds_config.json but I think the gap is too big.

I'm working on a 2D Parallelism solution that will combine pipe|model-parallelism w/ ZeRO-DP (DeepSpeed), which should enable such feats with huge models, but it might take some time. The docs aren't quite there so it takes a lot of trial and error to move forward. You may want to track this PR #9765 for updates.

Alternatively when fairscale or DeepSpeed releases ZeRO phase 3, you shouldn't have a problem loading this model onto 4x 24GB gpus. Currently the problem is that the model params are too big w/o phase 3. In phase 3 params are partitioned too - problem solved.

mxa4646 · 2021-01-29T06:12:53Z

I tried mt5-xl on 4x 40gb gpu setup and it worked, but took ~29GB on each GPU, so there is the problem - you're 5GB short.

That's help a lot! Thank you!

I am also looking forward to ZeRO stage 3 and your pipe|model-parallelism. Hope one day we can working on it. Thank you again!

patil-suraj · 2021-01-29T08:19:38Z

And I got the overflow problem. This is not surprising me because MT5-large seems not fixed FP16 yet.

Did you get nan loss or gradient overflow warning ? And yes, fp16 is still not working for mT5-large

I assume that T5-3b and MT5-xl should be in the same order of magnitude

mT5-xl is actually quite bigger than T5-3b for two reasons

It's vocab_size is huge (250112), which results in bigger token_embedding layer and final linear layer.
It's based on t51.1 which uses gated-gelu activation instead of relu. gated-gelu adds one extra linear layer in every feed-forward layer.

mxa4646 · 2021-01-29T13:35:17Z

@patil-suraj That's very helpful! Thank you a lot!

Now I understand that there are many differences between mT5-xl and T5-3b, and I will set up separate experiments for them in the future. By the way, do you have any plans to repair the FP16 in mt5-large/xl ?

dorost1234 · 2021-03-19T22:58:28Z

Dear @patil-suraj, here you have mentioned for mt5-small you have made it work with fp16? since you did not mention this model, do you mind telling me how you made it work? I am having a hard time with mt5-small with fp16 thanks a lot for your advice

loretoparisi · 2022-03-15T14:26:00Z

I have a similar error here

from transformers import T5TokenizerFast, MT5ForConditionalGeneration

tokenizer = T5TokenizerFast.from_pretrained('google/mt5-base') # "google/mt5-base" "google/mt5-large" "google/mt5-xl"

model = MT5ForConditionalGeneration.from_pretrained('google/mt5-base', return_dict=True)

condition = "translate English to German: "
input = "My name is Azeem and I live in India"

# You can also use "translate English to French" and "translate English to Romanian"
input_ids = tokenizer(condition+input, return_tensors="pt").input_ids  # Batch size 1

outputs = model.generate(input_ids)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(decoded)

Stacktrace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-8-f9822d331a70>](https://localhost:8080/#) in <module>()
      3 tokenizer = T5TokenizerFast.from_pretrained('google/mt5-base') # "google/mt5-base" "google/mt5-large" "google/mt5-xl"
      4 
----> 5 model = AutoModelForSeq2SeqLM.from_pretrained('google/mt5-base', return_dict=True)
      6 
      7 condition = "translate English to German: "

8 frames
[/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py](https://localhost:8080/#) in __getattribute__(self, key)
    250         if key != "attribute_map" and key in super().__getattribute__("attribute_map"):
    251             key = super().__getattribute__("attribute_map")[key]
--> 252         return super().__getattribute__(key)
    253 
    254     def __init__(self, **kwargs):

AttributeError: 'MT5Config' object has no attribute 'relative_attention_max_distance'

@stas00 any idea? I'm using HF master:

!pip install git+https://github.com/huggingface/transformers.git

patil-suraj · 2022-03-15T15:00:45Z

@loretoparisi

This is because T5Config now has relative_attention_max_distance attribute introduced in the #16155 which was missing from MT5Config. Fix is here #16170

mxa4646 mentioned this issue Jan 28, 2021

Model Parallelism and Big Models #8771

Open

stas00 added the DeepSpeed label Jan 28, 2021

stas00 removed the DeepSpeed label Jan 28, 2021

stas00 changed the title ~~Trainer's DeepSpeed integration cannot be used on mt5-large and mt5-xl~~ [trainer] seq2seq doesn't handle mt5 correctly Jan 28, 2021

stas00 mentioned this issue Jan 29, 2021

[seq2seq] correctly handle mt5 #9879

Merged

stas00 closed this as completed in #9879 Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer] seq2seq doesn't handle mt5 correctly #9865

[trainer] seq2seq doesn't handle mt5 correctly #9865

mxa4646 commented Jan 28, 2021

stas00 commented Jan 28, 2021 •

edited

Loading

stas00 commented Jan 28, 2021 •

edited

Loading

stas00 commented Jan 29, 2021

mxa4646 commented Jan 29, 2021 •

edited

Loading

stas00 commented Jan 29, 2021 •

edited

Loading

mxa4646 commented Jan 29, 2021 •

edited

Loading

patil-suraj commented Jan 29, 2021

mxa4646 commented Jan 29, 2021

dorost1234 commented Mar 19, 2021

loretoparisi commented Mar 15, 2022

patil-suraj commented Mar 15, 2022

[trainer] seq2seq doesn't handle mt5 correctly #9865

[trainer] seq2seq doesn't handle mt5 correctly #9865

Comments

mxa4646 commented Jan 28, 2021

Environment info

Who can help

Information

To reproduce

Expected behavior

stas00 commented Jan 28, 2021 • edited Loading

stas00 commented Jan 28, 2021 • edited Loading

stas00 commented Jan 29, 2021

mxa4646 commented Jan 29, 2021 • edited Loading

stas00 commented Jan 29, 2021 • edited Loading

mxa4646 commented Jan 29, 2021 • edited Loading

patil-suraj commented Jan 29, 2021

mxa4646 commented Jan 29, 2021

dorost1234 commented Mar 19, 2021

loretoparisi commented Mar 15, 2022

patil-suraj commented Mar 15, 2022

stas00 commented Jan 28, 2021 •

edited

Loading

stas00 commented Jan 28, 2021 •

edited

Loading

mxa4646 commented Jan 29, 2021 •

edited

Loading

stas00 commented Jan 29, 2021 •

edited

Loading

mxa4646 commented Jan 29, 2021 •

edited

Loading